Arsenault & Perron - Comment les jeux vidéo ont conquis l'univers Star wars
Handling Chinese-Language Bibliographic Data A North American Perspective Honk Kong Library...
-
Upload
derick-blankenship -
Category
Documents
-
view
218 -
download
3
Transcript of Handling Chinese-Language Bibliographic Data A North American Perspective Honk Kong Library...
Handling Chinese-Language Bibliographic Data
A North American Perspective
Honk Kong Library Association, August 22nd, 2003Clément Arsenault, assistant professor
École de bibliothéconomie et des sciences de l’informationUniversité de Montréal
©2003, Clément Arsenault 2
Overview
Multilingual / Multiscript Information Systems
Transliteration, Transcription and Romanization
Romanization Systems for Chinese
Transliteration in Bibliographic Records
Word Division
Parsing Chinese Text
Word Division for Bibliographic Control
An Retrieval Experiment
©2003, Clément Arsenault 3
Multilingual / Multiscript Info. Systems
Integrate several languages and/or Several scripts
10 major scripts, to write ~95% of all languages
Japanese– Hiragana あいうえお…– Katakana アイウエオ…
Korean 가각갂갃간…Chinese 甲乙丙丁…
Romanabcdeéèêœ …
Greek αβγδε …Cyrillic авгдеж …Hebrew אבגדה …Arabic ث ت ا ب …
ح جIndic (11) …अआइईउऊThai …กขฃคฅฆ
©2003, Clément Arsenault 4
Multilingual / Multiscript Info. Systems
System contains records representing items in more than one language
System contains records that are, in total or in part, in more than one language
The system interface is in more than one language• System prompts• Command / query language
The system is able to display text in more than one script The system allows the end user to build queries in more
than one script
©2003, Clément Arsenault 5
Multilingual / Multiscript Info. Systems
Non-Roman data in North American OPACs
Stored? Displayed?
yes
no
yes
no
Indexed?yes
no
Romanization Vernacular
Cata
log
uin
gR
etr
ieval
©2003, Clément Arsenault 6
Chinese Language: Some Facts
Number of characters• 9,353 in 1st century C.E.• 47,043 in 1716 ( 康熙字典 )• ~60,000 in 1990 ( 漢語大字典 )
Occurrence• 1,000 characters 90%• 2,400 characters 99%• 3,800 characters 99.9%• 6,600 characters 99.999%
©2003, Clément Arsenault 7
Romanization Systems for Chinese
What is Transliteration?• Script conversion
– Transliteration:– Transcription:
script scriptsound script
What is Romanization?• Converting a script to the Roman script
Romanizing Chinese script• Only transcription is possible
©2003, Clément Arsenault 8
What sounds?• Vast number of regionalects / dialects• Standard is Mandarin (based on Beijing)
cha — Northern
zo — Suzhou
dzo — Wenzhou
te — Xiamen (Amoy)
tssa — Guangzhou (Canton)
茶Romanization Systems for Chinese
©2003, Clément Arsenault 9
What sounds?• Then how to render it… ?
春
Romanization Systems for Chinese
chunch’untchuntchwuntchountchounne...
©2003, Clément Arsenault 10
Historical overview• Fanqie method (early 1st millennium)
• 烃 = 土 + 丁 (tu + ding)
• Matteo Ricci & Father Nicolas Trigault (17th cent.)• Hundreds of schemes developed since then
• Mostly developed by Westerners• Wade (English) Wade-Giles (English/American)• EFEO (French)• Yale (American)• Lessing-Othmer (German)• …
Romanization Systems for Chinese
©2003, Clément Arsenault 11
Historical overview• Systems developed by Chinese
• Gwoyeu Romatzyh Pinin Faashyh (1928)• Beifangxua Latinxua Sin Wenz (1931)• Hanyu pinyin fang’an (1956)
Romanizing Chinese for bibliographic Control in North America
• Wade-Giles (through October 2000)• Pinyin (After October 2000)
Romanization Systems for Chinese
©2003, Clément Arsenault 12
Wade-Giles vs Pinyin
Example: 唐宋全诗 Wade-Giles: T‘ang2 Sung4 ch‘üan2 shih1
Pinyin: Táng Sòng quán shī
Romanization Systems for Chinese
Wade-Giles Used mostly in English-speaking
countries Was used until 2000 at LC (and
mainly in NA libraries) Rarely used in teaching anymore Heavy use of punctuation and
diacritics
Pinyin Used internationally
Used for many years in libraries in Europe and Australia
Used for teaching Minimal use of punctuation and
diacritics
©2003, Clément Arsenault 13
Transliteration in Bibliographic Records
Is transliteration necessary / useful? Necessary for oral and written communications
All 川崎 models come fully equipped.
©2003, Clément Arsenault 14
Transliteration in Bibliographic Records
Is transliteration necessary / useful? Necessary for oral and written communications
All 川崎 models come fully equipped. All Kawasaki models come fully equipped.
©2003, Clément Arsenault 15
Transliteration in Bibliographic Records
Is there a need for Romanized fields in bibliographic records?
• In printed records?• In electronic records?
A special case for Chinese• Three major obstacles
• Filing: difficult to browse Chinese characters• Data entry: users need to Romanize anyway • 25% of sources in Roman only (Anderson 1972)
©2003, Clément Arsenault 16
Transliteration in Bibliographic Records
Filing Chinese characters• Number of strokes• Semantic roots
• Then, number of strokes
• Based on shape• 4-corners method
• By sound• Romanization… (A–Z)
• Simplest and fastest method
©2003, Clément Arsenault 17
Data entry• Keyboards (more than 700 methods)
• Special keyboards• QWERTY or AZERTY keyboards
– orthographic-based methods– phonetic-based methods
• Special devices• OCR• Pressure sensitive tablets• Voice recognition…
Transliteration in Bibliographic Records
©2003, Clément Arsenault 18
Word Division
Chinese is written without word delimiters
多接近大自然總是不錯的,因為人是從大自然而來的。
But Romanized Chinese could/should be…Duo jie jin da zi ran zong shi bu cuo de, yin wei ren shi cong da zi ran er lai de.
Duo jiejin daziran zongshi bucuo de, yinwei ren shi cong daziran erlai de.
©2003, Clément Arsenault 19
Reasons for delimiting Romanized Chinese• Syllabic structure is too simple for efficient retrieval
• ~1300 single syllables (~400 base syllables)• “mā, má, mǎ, mà” indexed as “ma”
• Single syllables• High level of ambiguity (homophones)• Ambiguous 8 times out of 9• Readability is almost nil
• Joined syllables• Resolves ~95% of ambiguity cases (King 1983)• Greatly improves readability
Word Division
©2003, Clément Arsenault 20
But, no consistent rules… 中國話
• Zhong guo hua
• Zhong-guo hua
• Zhongguo hua
• Zhongguohua
Word Division
©2003, Clément Arsenault 21
Parsing Text
What is a word?• Visual word• Semantic/syntactic word...
Often based on conventions Not always consistent (in Google, 4 Aug. 2003)
– earring (461,000) ear ring (18,200)– shoemaker (465,000) shoe maker (21,100)– bottleneck (419,000) bottle neck (32,800)– firefighter (687,000) fire fighter (121,000)– flowerpot (42,100) flower pot (54,300)
©2003, Clément Arsenault 22
Word Division for Bibliographic Control
1997: LC announces change to Pinyin Use monosyllabic or polysyllabic transcription?
Monosyllabic division Polysyllabic division
• Consistent • Increases recall • Lowers precision • Easier to convert from
existing Wade-Giles • Easier to generate
Romanization from a string of Chinese characters
• Difficult to be consistent
• Lack of established standard • The proper format according to Hanyu
pinyin fang’an (the PRC pinyin standard) • Represents the nature of the language • Improves readablity when browsing • Improves precision • More effective in voice recognition / text-
to-speech implementations
©2003, Clément Arsenault 23
A Retrieval Experiment Experiment designed to test the
effect of syllable aggregation on retrieval
Part of a Doctoral Thesis at University of Toronto
©2003, Clément Arsenault 24
Statement of the Problem
Conversion to pinyin (1st Oct. 2000–1st Oct. 2001) No inclusion of tones Text division (syllable aggregation)
• Monosyllabic for common words (e.g., 东西 dong xi)• Polysyllabic for proper words (e.g., 上海 Shanghai)
Consequences: Two different methods used together
• Confusing!!! Only ~400 index “terms” available for all common
words• Too few!!!
©2003, Clément Arsenault 25
Statement of the Problem
Conversion from Wade-Giles to Pinyin• Convert to monosyllabic?• Convert to polysyllabic?
Potential impact on retrieval / browsing Measure impact on retrieval
• Effectiveness (success in finding records)• Efficiency (effort spent to find them, i.e., time)
©2003, Clément Arsenault 26
Research Questions
Determine if using polysyllabic pinyin entries, over monosyllabic pinyin entries, in bibliographic records improves retrieval effectiveness and efficiency in known-item exact-title searches.
Determine if using polysyllabic pinyin entries, over monosyllabic pinyin entries, in bibliographic records improves retrieval effectiveness and efficiency in known-item keywords-in-title searches.
©2003, Clément Arsenault 27
Research Questions
In other words What is the effect of aggregation patterns on…
Six variables were defined Six hypotheses
Exact-title Keywords
Efficiency Q1 Q3
Effectiveness Q2 Q4
©2003, Clément Arsenault 28
Research Questions
Definitions Exact-title search mode (with implied truncation)
Request for “Gone with the wind”
QUERY: “gone with the”
Keyword search modeRequest for “Gone with the wind”
QUERY: “wind” AND “gone”
©2003, Clément Arsenault 29
Hypotheses
Effect of using polysyllabic transcription over monosyllabic Predictions
Efficiency Effectiveness Phrase Keywords Phrase Keywords
©2003, Clément Arsenault 30
Methodology
Retrieval task:• Search 2 lists of 20 titles (in Chinese characters) using:
– Wade-Giles Romanization (WG)– Pinyin-monosyllabic Romanization (mPY)– Pinyin-polysyllabic Romanization (pPY)
• Replicate using two search modes:– Exact-title searching (phrase matching)– Keyword searching
• Measure:– Time to complete task (efficiency)– Number of items/records found (effectiveness)
©2003, Clément Arsenault 31
Methodology: sampling
Purposive sample of 30 students• Graduate students• Native speakers of Chinese• Good working knowledge of Romanization
Each participant was given $20 CAN 30 participants × 2 tasks = 60 trials
©2003, Clément Arsenault 32
Methodology: design and procedures
My main statistical design was a 2 × 3 randomized factorial design with unbalanced proportional data. Participants were replicated over factor A.
BA WG mPY pPY
X-title 6 12 12KW 6 12 12
BA WG mPY pPY
X-title µ11 µ12 µ13
KW µ21 µ22 µ23
©2003, Clément Arsenault 33
Methodology: apparatus
20 titles × 2 lists = 40 titles 3 databases of ca. 50K records (RLIN db)
• WG / mPY / pPY Databases running on Microsoft Access Interface in HTML format accessed with Web
browser ASP links interface to database and records
transaction logs
©2003, Clément Arsenault 34
Methodology: apparatus
Titles to be searchedID-
number
1. 颤栗 / 蒋伯潜 ____ — ____
2. 盐山新志:河北省 / 汪美瑞 ____ — ____
3. 生死场 / 顾宝民 ____ — ____4. 西藏那曲地区土地资源 / 施其明 ____ — ____
©2003, Clément Arsenault 35
Transaction Log Analysis (TLA)
Components of TLA
Database
Logging Program
Inte
rfac
e
Logs
End-user
Methodology: data collection
©2003, Clément Arsenault 36
Interaction with external software components
Internet
Internet Information ServerASP
Scripts
Win NT
SQL Server
HTML Files
OD
B C
ADO
ASPWWW
Methodology: data collection
©2003, Clément Arsenault 37
Internet
Internet Information ServerASP
Scripts
Win NT
SQL Server
HTML Files
OD
BC
ADO
ASPWWW
DatabaseLogging Program
Inte
rfa
ce
Logs
End-user
TLA
ASP
Methodology: data collection
©2003, Clément Arsenault 38
Generated logs
Methodology: data collection
©2003, Clément Arsenault 39
Statistical analysis
Exact-title KeywordsWG /mPY
WG /pPY
mPY /pPY
WG /mPY
WG /pPY
mPY /pPY
Completion time — —Time/item found — Expected Search Length — — — — Number of queries — — — —Success rate — — — —Success rate per query — — — — — —
Effi
cie
ncy
Eff
ecti
ven
ess
©2003, Clément Arsenault 40
Results Aggregation improves search efficiency for title
searches Keyword search is especially influenced by
aggregation
Keyword search is especially important for Chinese titles since it is not unusual that the pronunciation of one of the first characters in the title is unknown
x-title Keywords
Efficiency Q1 Q3
Effectiveness
Q2 Q4
©2003, Clément Arsenault 41
Results
Using mono- and polysyllabic aggregation concurrently is a great source of confusion to end-users
Retrieval with Romanization works relatively well
Success rates for know-item searches vary between 72% and 91% depending on Romanization system used
©2003, Clément Arsenault 42
Results
However, for a non-negligible proportion of end users, Romanization-based retrieval poses real problems
Around 25% of the participants made between 50 and 80 Romanization errors during the retrieval task
0
1
2
3
4
5
6
7
0–4 10–14 20–24 30–34 40–44 50–54 60–64 70–74 80–84
Number of Errors
Num
ber
of P
arti
cipan
ts
©2003, Clément Arsenault 43
Results Cause of errors
Aggregation• A1: Two unlinked units were linked
(e.g., dong xi / dongxi)
• A2: One linked unit was unlinked(e.g., Shanghai / Shang hai)
Romanization• R1: Character was misread
(e.g., 粟 su / 栗 li)
• R2: Romanization was misspelled(e.g., 林 was written ling instead of lin)
©2003, Clément Arsenault 44
Results
Aggregation and Romanization errors
0,20,30,40,50,60,70,80,9
WG mPY pPY
(av
era
ge
pe
r q
ue
ry) Aggregation errors
Romanization errors
©2003, Clément Arsenault 45
Results Wade-Giles notation, more “forgiving”
Wade-Giles Pinyin
chen / ch’en chen / zhen
chi / ch’i ji / qi
chu / ch’u / chü / ch’ü zhu / chu / ju / qu
… …
©2003, Clément Arsenault 46
Results Analysis of Romanization errors
Romanization• R1: character misread (e.g., 粟 su /
栗 li)
• R2: Romanization misspelled• chen / cheng (dental nasal vs. velar nasal)• cu / zu (voiced fricatives vs. unvoiced fricatives)• hu / fu (glottal fricative vs. labiodental fricative)• la / na (alveolar lateral vs. alveolar nasal)
©2003, Clément Arsenault 47
Results
Phonetic confusion
0
0,05
0,1
0,15
0,2
Fricatives Nasals Other
(avera
ge p
er
qu
ery
)
©2003, Clément Arsenault 48
Conclusions
1. In KW mode, polysyllabic entries help improve precision
2. More aggregation errors in polysyllabic, but overall not overwhelming
3. Dual aggregation format is confusing to end-users (important source of error)
4. Still relatively high proportion of errors caused by confusion in Romanization
©2003, Clément Arsenault 49
Further research Project #1
Using vernacular script for retrieval• What model???
Query DataInput
Romanization
Other input method
Romanization
汉字
Romanization
汉字
©2003, Clément Arsenault 50
Further research Project #2
Using XML to encode non-Roman bibliographic data
• What is the viability of XML as a conversion format for bibliographic records containing non-Roman data?
• How can we use existing conversion schema, for instance those developed at LC?
• Does XML offer the required flexibility for publishing non-Roman on the Web with enhanced retrieval capabilities?
©2003, Clément Arsenault 51
Further research
©2003, Clément Arsenault 52
Further research Benefits
Integration of resources created under a decentralized environment
Creation of specialized retrieval tools adapted to the specific nature of the data
Increased visibility for resources in non-Roman alphabets
©2003, Clément Arsenault 53
Questions Thank you!