The Syllables in the Haystack: Technical Challenges of Non-Chinese in a Wade-Giles-to-Pinyin Conversion
Gail Thornburg ( firstname.lastname@example.org) is Consulting Software Engineer at OCLC Online Computer Library Center, Dublin, Ohio.
This paper describes the technical challenges of developing software to convert Wade-Giles to Pinyin in bibliographic records that are not in Chinese.
The Chinese language is different from alphabetical languages to which most Westerners are accustomed. To represent items using such a language in a bibliographic database employing principally roman script requires some form of converting the original to a representation in alphabetic characters and possibly diacritics. Systems of such transliteration for Chinese date from at least 1605, but one prevalent in the last hundred years or so in the United States is the Wade-Giles (WG) system. 
Recently the Library of Congress (LC) decided to discontinue use of WG and adopt the newer Pinyin form of transliteration, adopted by the People’s Republic of China in the late 1950s. This meant conversion of Chinese records in the Online Com- puter Library Center (OCLC) author- ity file and OCLC bibliographic file to Pinyin.
This evolved into a consortial effort among LC, Research Libraries Group (RLG), and OCLC, an effort extending over three years. Earlier efforts by the OCLC Office of Research have been reported elsewhere. 
Once requests for comments and discussions with key libraries had taken place, there were major parts to the conversion effort to plan:
- LC conversion of Chinese author- - ities by OCLC, scheduled to take place not later than October 1, 2000, “Day One”
- Conversion of LC bibliographic records (bibliographic records) by RLG
- Bibliographic records conversion by OCLC and RLG of their respective union catalogs
- Conversion by OCLC of the non-Chinese records containing Chi- nese text, and later by RLG of similar records in their databases
- Conversion efforts by OCLC and RLG of records of institutions from WG to Pinyin
Development of the Specifications
These were developed cooperatively as the project progressed. The specifications can be seen at the LC Web site.  Some general points to keep in mind about the conversion:
- Only fields/subfields specified by LC in the specs were to be converted.
- Conversions made heavy use of dictionary lookups, not only for conversion of WG syllables to Pinyin counterparts, but also for phrase matching as in place names. The conversion sequen- ces, which were directions for specific types of conversion such as geographic place names or Taiwan names, dealt with special types of translations. The dictionaries for these sequences were generally organized in the form of longest to shortest entries, to allow the most complete phrase matching.
The Standard Dictionary (STD) was the complete list of the more than four hundred WG syllables and their Pinyin forms. This was searched after all other conversion sequences had a chance to do special matching.
Anyone familiar with WG and Pinyin romanization schemes knows that while diacritics and initial letters are usually enough to signal WG to the human eye, in fact the Pinyin and WG schemes have considerable overlap. In some cases these syllables are uniquely romanized for WG and Pinyin; in other cases the same syllables are romanized in just the same way; and in yet others, syllables spelled the same way in WG and Pinyin represent different sounds in each scheme. In testing early conversions, it quickly became evident that any automated conversion scheme needed to distinguish WG that could only be WG from WG that could also be Pinyin, or could be a common match that could be either.
As if this were not complex enough, it became apparent that the overlap of WG Chinese with other languages could lead to erroneous conversion of other languages to Pinyin. Systems of safeguards were developed.
One safeguard in the specification broke down the STD into four subdictionaries: Unique WG, Unique Pinyin, Same syllable in both, and Common (same spelling but different sound).
At LC this meant a new conversion sequence called “Mixed Text” to indicate what actions should be taken if troublesome mixes of the four categories of syllables were encountered. The Conversion Sequence Mixed Text attempted to identify cases where WG and other text could be discriminated safely, and whether these cases should be flagged for manual review, converted, or skipped. This was implemented at OCLC largely through what came to be called IsWadeGiles testing. This gatekeeper function is discussed further below.
In addition, special requirements were described in the evolving specifications, for cases in which, due to Board of Geographic Names (BGN) requirements, Taiwan place names were to be excluded from conversion.
The Taiwan conversion sequences started as a short list of examples of what not to convert. Prototype software was rapidly developed to do a sort of learning by example and to elucidate what the requirements of the spec needed to be. LC then reviewed a mini-test of sample conversions of fields likely to contain Taiwan place names and wrote the conversion sequence itself based on the behavior of the software.
Other conversion challenges included the interpretation of generic terms for jurisdictions implemented in the G conversion sequences (G3, G2, G1). Two examples are Feng-hsin hsien to Fengxin Xian, or Ying-hsien (Qualifier) to Ying Xian (Qualifier). The idea was that different types of place names could be predicted to occur generally in specific fields and not in others, and that the software should apply the G conversion sequences only in the former. The challenge was that the generic terms did not necessarily represent place names, so care was needed not to run the sequences where jurisdiction names would not be expected to occur. The software being developed did not, of course, know Chinese.
With testing came the recognition that overlap of syllables between WG and Pinyin was only one problem in discriminating WG safe from WG risky conversions. Even if a record was selected as Chinese, the records being converted usually contained strings of text in English and other languages.
In the realm of non-Chinese interspersed with Chinese in a subfield, even abbreviations such as Pa. and Jan. could be misinterpreted as Chinese if special screening was not done to avoid converting such English strings in the middle of Chinese subfields. The earliest of the tests featured software that readily converted English, Russian, Italian—and even Pinyin—to Pinyin (not a good outcome).
Too-common syllables were one of the key pitfalls. It was soon realized that when the software analyzed a subfield for conversion, it was critical for the software to know if the WG match was a case such as a, an, to, no, Jan, Ka, Jun, lung, sung, I, lo, la, le, so, sun, Juan. These might well be WG syllables, but might also be English, French, Italian, Spanish, or other languages. Converting French to Pinyin is not part of the spec. Nor is converting a subfield from
A concordance to: Yen tzu ch’un ch’iu
A concordance duo: Yan zi chun qiu
A little English-to-Pinyin conversion goes a long way. Soon, revisions to the spec directed blocking the conversion of“to” in certain subfields, but the software also had to try to guess whether it was the English “to” it had blocked, or the Chinese “to” which should be unblocked and converted to Pinyin.
Software Design Challenges
Early in the project, it occurred to the development team that the loosely defined specifications were going to be an issue. Early teleconferences with LC, RLG, and OCLC led to the suspicion that minds were going to be changed and rechanged, and compromises made. Moreover, both time and staff were limited. Yet it was necessary to expend the time to come up with an easy-to-understand flexible record structure that could be parsed by all the modules.
With an attempt to simulate realism, an early design decision was made that flew in the face of conventional software design. A module would be constructed for each conversion sequence in the spec and the same software run wherever indicated. At the same time, the field/subfield modules would be treated as unique. In this way enough granularity could be retained to allow changes in the order in which conversion sequences were run, the levels of safety checks required, and other very specific considerations.
This seemed to make sense as soon as it was realized that even the indicator values for certain fields determined varying courses of processing. This resulted in a huge number of small subfield modules that duplicated a lot but were easy for even the most junior of the development team to change in a hurry. Programs were developed to run mass compiles to keep all the pieces and parts in sync. Changes to the spec for the authorities and later bibliographic records occurred to the very eleventh hour of the development and testing process.
First Stage—Authorities Software
At OCLC, the first deadline was the approval of the authorities software. Since this was developed first it was most the prototypical. Note that the conversion deadline had been fixed at October 1, 2000. A moratorium on Chinese cataloging by libraries had to be imposed for the interval of the conversion. The authorities records needed to be converted in advance of the bibliographic records. In cases where it was not practical or even theoretically possible to ensure a perfect conversion, the outputs of the authorities conversion were organized in separate groups to aid in staged evaluation of riskier conversions such as the G conversion sequences. For bibliographic records the specs did include a text subfield, the 987$f, to which risk flags were written as was shown useful in testing. For the authorities records, there was no such provision for a special text field, so the records could only be isolated by grouping the outputs of the conversion. The software would flag a given subfield as requiring manual review and leave such subfields alone, but also would (by this sorting) flag riskier cases for later checking, even though the conversion was considered successful.
The Sequences of Conversions
The first conversion modules written dealt with place names (the C1/
C2/C3 sequences, the G1/G2/G3 sequences). These handled conversions of phrases such as Hsiao-shan shih (Chekiang Province, China) to Xiaoshan (Zhejiang Sheng, China) or Chin-chou shih (Liaoning Province) to Jinzhou (Liaoning Sheng)
The dictionaries were searched in order by longest to shortest phrase, the substitution text was replaced, and the subfield passed on to the next conversion routine. Each program had to maintain and pass along a shadow subfield that indicated which portions of the text had already been converted. Bear in mind, except for the IsWadeGiles gatekeeper, none of the individual conversion routines were aware of the overlap areas of WG and Pinyin, so it was necessary to avoid possible reconversion by later conversion sequences.
Phrases versus Syllables
What worked well enough for the phrase matching conversion routines was inadequate to process a subfield against the STD. This sequence needed to keep track of potentially hundreds of individual words and syllables and the surrounding punctuation and spacing. It was also necessary to maintain the original spacing and punctuation of the subfield except as explicitly instructed by the spec.
To organize this information, a table-like structure was implemented to represent the contents of the subfield, syllable by syllable, and to store the punctuation before and after each word or syllable. Each row of this table was one word or syllable found; different columns represented different types of matches made. The structure also stored the before-and-after lengths of the WG/Pinyin forms, or a zero length to indicate no match. In some cases the punctuation between syllables would be retained or replaced, and spacing might change, depending on the particular syllable and its neighbors.
So, while the subfield or its major parts were the general unit of scrutiny, some conversion modules needed to focus on phrase matches and be blind to the additional contents. Other modules needed to tokenize—break down—the whole subfield into its smallest component pieces and count every space and punctuation mark, all the while turning a selectively blind eye to things that didn’t always matter. Punctuation such as parentheses might matter in matching to the C sequence dictionaries, but not for the STD.
The Segue from Chinese Authorities to Chinese Bibliographic Records
It was initially planned to use the same conversion software on the fields converted in bibliographic records as was run in authorities records; problems were soon encountered in this approach. One was scope: the size of fields in bibliographic records quickly exceeded what had been reasonable bounds in authorities records. For a time there was a processing tension between what was big enough to encompass even monster bibliographic records with hundreds of subfields and subparts of subfields, and what virtual meta-construct might grow so big in running memory that the software would be too big to run at all. One size had to fit all, and it wasn’t a small. The software was put on periodic diets, but the development team had to lie in wait and catch it bingeing one interesting evening, to track down the problem.
Another issue in moving to the bibliographic world related to the data. Many rules and assumptions of the software that worked reasonably well in the disciplined environment of authority records were bent or broken in the larger world. The test set furnished by LC included records with notes fields with such irregularities that one wondered if the person entering them had a broken finger or two.
This quickly revealed interesting gaps or levels of “trust” in the authorities-developed software that needed to get very mistrustful in the bibliographic records environment. The conversion sequence for STD was rewritten to tighten up the hand- ling of widely varying punctuation and other practices.
In reviewing CONSER records, problem patterns were noticed. For instance, the usual requirement of manual review for subfields solely of Same/Common syllables could lead to needlessly high review rates for certain subfields. For example, a subfield might consist of
$aTi 1 pan..
$cmin kuo 66 (1979)..
The first has two Common syllables and the second has one Same syllable and one Common. In cases where the only WG in the subfield is all Same and Common, the subfield is flagged for review because the software cannot possibly know whether it is really WG or already Pinyin. Yet, in cases like those previously mentioned it seemed safe enough to go ahead and convert the patterns, and save human reviewers needless work. Several examples of these special-case instances were developed in the course of the bibliographic records conversion testing. The challenge was to find a way to describe their occurrence narrowly enough to apply to all subfields, since the IsWadeGiles module making the decision and the conversion sequence modules generally did not know which subfield they were processing.
The first question many asked of the team was “Why would you want to convert non-Chinese?” Indeed a lot of effort had been devoted early on to avoiding that outcome. Still, it was recognized that many records whose language code was not Chinese did in fact contain Chinese text that would be desirable to convert in the course of the project.
About the time it was recognized that separate conversion software would be needed for non-Chinese records, the testing of non-Chinese conversions was moved to a phase after the completion/approval of the Chinese conversion software. This conversion software was to be applied to the Chinese records in the OCLC Online Union Catalog (WorldCat).
At about the time approval was achieved, it was also recognized that for WorldCat conversions that it would be useful to sort out even those Chinese language records that contained Japanese or Korean, or contained so many languages (more than four language codes in the 041 fields) as to need extra checks. These were screened from the Chinese conversion effort, to be picked up later.
Selection of non-Chinese records to try to convert was the initial issue. Part of the specification developed with LC identified the selection process and convertibility testing that would be used to identify good candidate records for conversion. The first non-Chinese test set submitted to LC consisted solely of a selection/rejection set, to determine if the records identified by the software appeared to be reasonable choices and if those omitted seemed correct. The challenge was to find as many as possible but to throw a lot back in convertibility testing. This paralleled the authorities selection process to a degree, in that authority records have no language code.
The specific criteria for selection are detailed in the specification for non-Chinese posted on LC’s Web site. In general, records with no 987 (meaning the record had not been converted already, by software or human cataloger) were examined. In addition to records that would have been excluded from the Chinese conversion as noted above, non-Chinese language codes where the 041 tag, if present, contained “chi” in one subfield, were looked for. Also scrutinized were geographic area codes for countries that seemed likely, the inclusion of Chinese/Japanese/Korean (CJK) diacritics in the record, and even the word “Chinese” in a 500 or 546 field. These criteria were used to generate a preselected set, intended to be broad. The software looked for music records that have no language code.
The second phase of selection was to run a sort of IsWadeGiles test on a list of subfields in the record. If a subfield scrutinized in the record emerged from this testing with a status that indicated convertible text, the record was included in the set of those to convert.
Once the selection phase was complete, the conversion itself was run on the set of records selected. This conversion resembled the Chinese bibliographic recordsconversion, but was elaborated in several ways.
Tailoring the Conversion Software for Non-Chinese
At this point, the WorldCat conversion of Chinese bibliographic records had been run, and inspections of each conversion set led to the belief that the software was sound. So the specification of non-Chinese based itself largely on the same software, with somewhat fewer fields converted. The intended design was to deviate from use of the Chinese software only as found to be necessary.
By this time it became apparent that the presence of Korean or Japanese text in the selected set was a distinct threat. Yet Japanese or Korean are frequently mixed with Chinese in the same record, the same field, even the same subfield.
From LC two lists of Japanese and Korean romanized syllables that matched WG syllables were obtained. With these dictionaries added to the suite, it was possible, for records coded Japanese/Korean, to check the dictionaries to see if all the WG syllables found were also in the Japanese or Korean list. This would alert the IsWadeGiles software to a higher level of risk of misconversion of Japanese/Korean to Pinyin.
Considering the length of the lists, it did not seem useful to search these dictionaries unless Japanese or Korean was indicated by the language coding; too much legitimate WG could be excluded, especially in short subfields. This pointed to the need to think about discrimination patterns.
The team worked with LC and staff at OCLC to develop lists of characteristic patterns of romanized letters that would occur in Japanese or in Korean, but would never occur in WG Chinese. Some of these patterns (initial letter b, end letter m) proved to be too “noisy” in terms of overlap with English terms or place name terms in the phrase dictionaries.
Some, however, proved more effective in discrimination. The diacritics patterns macron-o or macron-u or breve-o or breve-u could be searched for, generally enabling a subfield to be eliminated from further consideration. The longer pattern matches of letters served as further discriminators, such as the Japanese kyo, ryu, pyan, or terminal aa, ae, au, ea, ee, eo, eu.
Another issue was the general risk of converting personal names in non-Chinese records. There was perceived potential for converting names that were actually not Chinese or too generic to determine effectively.  It was decided that personal names fields would be converted only by use of the authority control software run by the OCLC Lacey Product Center, formerly known as OCLC/WLN, and that personal names fields would only be converted if the match was to a subfield a, plus either $c or $d. In cases where no authority control match was made, the subfield a was evaluated by the IsWadeGiles module, and if convertible text was found, the record was flagged in the 987$f to alert catalogers to this nonconversion.
IsWadeGiles Meets Non-Chinese
The processing of subfields in the IsWadeGiles module has a gatekeeper function. This is the software that scrutinizes a subfield, tokenizes it into individual words or syllables, and assesses matches against the dictionaries used in the actual conversion sequences. IsWadeGiles may decide a subfield should be skipped entirely, that it is safe to convert, or that it should not be converted but flagged for manual review and possibly broken down and reevaluated in Mixed Text.
Once IsWadeGiles has searched for place name (phrase) matches and has considered a list of stopwords, it searches the syllable dictionaries. It tallies the counts of unique WG, unique Pinyin, WG Same as Pinyin, and Common (could be either WG or Pinyin). If there are fewer matches than the count of tokens in the subfield these will be considered unknown—Other. At the end of this tally, rules for combinations are tested in order, until one fires, causing a decision to be made about the subfield, and exit from the evaluation program. If a subfield has only unique WG, and nothing else, go ahead and convert it. Most cases are more complex. The Same/Common rule illustrated previously is an example where generally the software would decide that the subfield should be flagged for manual review.
The rules are heuristic and so have a slight potential to make a conversion decision that a human reviewer would not. However, these rules had been developed iteratively over successions of authorities and bibliographic records tests reviewed. It was necessary to tailor them, with some care, to allow for special screening of non-Chinese in a high-risk environment. While the software is fairly conservative in its decisions, it was undesirable to have too much flagged for manual review, or skipped when it could have been converted.
One example was the evaluation of personal names fields not matched under authority control. The program needed to flag cases where WG clearly remained, but also try to avoid flagging the numerous cases of Chinese Westernized names (forms like C. C. Chen or Stephanie Chuang) that needed neither conversion nor human review. This was not hard to implement.
Then the team noticed that the variants implicit in bibliographic records conversion could lead the software to think that the subfield contained WG but also Other (and thus was possibly a Westernized name) in cases where the Other was a transposed letter or variant diacritic in a clear attempt to enter a WG personal name. This was a tricky situation; one cannot ignore the specific diacritics used and identify WG successfully. In this context normalization would be unworkable for an evaluative program. Moreover it would have been an enormous task to attempt any spelling correction. Known variants of the Chinese diacritic ayn were built into the dictionary loading, but there were only limited substitutions practical with the dictionaries, due to overlap.
Dictionary changes or variants were generally agreed to by all the consortium participants, with the exception of the IsWadeGiles stoplist used at OCLC. The stoplist attempted to identify common English descriptive phrases (such as “written by” or “published in”) that could be skipped, in order to convert more subfields. Over time, additions of common phrases to the stoplist generally improved the likelihood that Notes fields could be converted.
So how does one attempt to identify almost WG? There were no perfect answers, but one new approach was attempted. Little could be done about typos, but a search was added for diacritics as a rough discriminator of attempted WG entries from likely westernized names. It made the assumption that a westernized Chinese name probably would not feature diacritics.
If this was a program report for funded research, there would be a section on future research. For the development team, there were practical limits to time that could be spent on the project. Some of the areas that would be interesting to explore in future software implementation are as follows:
- Addition of a sort of primitive learning component. Given the sets of inputs (for example, ten thou- sand-plus bibliographic rec- ords at a time), the system should infer patterns in the data for improved processing of future sets. For instance, seeing repeated strings of text that were categorized as Other causing a subfield to be rejected for review would suggest (given tests) additions to the stoplist to be used in future runs.
- Patterns of rules firing or not. Observing and categorizing these could suggest to the system either proposals for rule simplifications or that pruning the rules was needed.
- External analysis of the rules structure. Over the course of the project, these expert-derived rules became quite complex. A meta-analysis, perhaps using existing software from other sources, could lead to rules simplification that would make maintenance of the system simpler.
- Harnessing the implicit network of subfield modules. Here there is a sea of modules waiting, in a sense, for a chance to fire. In a few cases, the domain experts recognized a need for special treatment of situations arising in specific subfield modules, but the architecture of the software made it more tractable to devise rules abstracted enough to cover all subfield modules. It might be interesting to observe the firing of these rules specifically. The goal would be to see if patterns and filters would naturally evolve. Groupings of types such as notes subfields, controlled access subfields, or publication date, might generate data and reports. These could be used to make the reason for the rules—now hidden in advice from the experts—more visible in the code.
- Extending the schedule to allow for testing alternative, looser criteria for selection of non-Chinese records. This might imply the need for development of other dictionaries, for instance, syllables in French or in Russian mimicking WG.
All the development staff at OCLC felt they learned from this project in the sense of extending programming skills. The team compiled some insights about the nature of such a project as this. For example:
- Don’t generalize from two examples. Or one.
- Choose battles carefully. (If resources are scarce, don’t write software for data that doesn’t exist.)
- When necessary, pretend to have a spec. Write software to match it.
- Rapid prototyping can help to flesh out a sketchy spec. The tradeoff is unrapid support.
- Dictionaries have bad days, too.
- Champions—for such a consortial effort to succeed, at least one person in each organization must be determined to make it happen, whatever it takes.
- Rules—conjectures and refutations as World View.
- Allow time in the project schedule for review, re-review, and coordination of new versions of the spec against existing software. The team could have used one full-time empoloyee to cover this.
- Test sets—domain experts can devise very good test sets. Expect gaps anyway.
- Version control—don’t write a line without it.
- Archive every e-mail. Make telephone records on every call. Make other people’s heads hurt by being able to find them later.
- Manual review by human experts is an essential part of the team solution.
- Rules for deciding whether to convert—tough to develop anything like a “covering set.” Possible to develop some critical gatekeepers.
- Authorities conversion to bibliographic records conversion: Expect more software revision than originally planned.
- Evolving specs—not all parts of the spec will attain equal robustness. Expect to do triage. (For instance software redundant checks, flagging riskier conversions.)
- Granularity—many small repetitive modules, one per subfield. Sounds like bad design; proved essential to the many changes over the lifetime of the project.
- The status of a subfield seems to suffer from multiple personality disorder
- Promise your superiors you’ll never agree to a project like this again. Cross your fingers behind your back.
- Moving targets—never believe all parties are working to the same spec.
Any effort on the scale of this project requires the help of a host of pro-fessionals behind the scenes. This
conversion could not have been acc-com-plished without the patient, tireless efforts of Philip Melzer and his colleagues at LC. Thanks are due also to the OCLC CJK Users Group Pinyin Task Force for their comments on test conversions, and to OCLC staff too numerous to mention, for all their painstaking advice and assistance.
References and Notes
1. Yewang Wang, “A Look into Chinese Persons’ Names in Bibliography Practice,” Cataloging and Classification Quarterly, 31 no. 1 (2000): 51–81.
2. Rick Bennett, “Bringing Chinese Cataloging Records Up-to-Date,” OCLC Newsletter (Mar./Apr. 2000): 18.
3. Library of Congress, Other U.S. Libraries Join International Community on Use of Pinyin. Accessed July 9, 2002, http://www.loc.gov/today/pr/2000/00-141.html. Library of Congress New Chinese Romanization Guidelines. Accessed July 9, 2002, http://lcweb.loc.gov/catdir/pinyin/romcover.html.
4. Yewang Wang, “A Look into Chinese Persons’ Names in Bibliography Prac- - tice.” Wang gives a detailed discussion of the many problems in the cataloging treatment of Chinese personal names.
Gail Thornburg ( email@example.com) is Consulting Software Engineer at OCLC Online Computer Library Center, Dublin, Ohio.