Consider the following table, generated from the 10M Character ROCLING corpus, a different corpus from the Academia Sinica corpus used by Huang and colleagues. As in Huang et al's table, it shows, for each component of the two-character county name, the mutual information between that character and the character 縣 `county'.
This was produced by a script that eliminates from consideration any examples where one of the two characters in a county name occurs directly before the character 縣. This has the desirable effect of eliminating the actual suoxie examples, as Huang et al did; but it is likely to err somewhat on the side eliminating too many examples. Is the resulting table a reasonable rendition of what's in the Huang et al papers, and if not what are the important differences?
台 東 | 台 | 0.285 | 東 | 1.402 |
台 北 | 台 | 0.285 | 北 | 1.173 |
花 蓮 | 花 | 0.623 | 蓮 | 1.242 |
彰 化 | 彰 | 1.483 | 化 | 1.129 |
苗 栗 | 苗 | 2.307 | 栗 | 2.985 |
台 中 | 台 | 0.285 | 中 | 1.004 |
澎 湖 | 澎 | 1.490 | 湖 | 0.991 |
雲 林 | 雲 | 1.718 | 林 | 1.608 |
台 南 | 台 | 0.285 | 南 | 1.573 |
桃 園 | 桃 | 0.986 | 園 | 0.911 |
高 雄 | 高 | 0.464 | 雄 | 1.018 |
南 投 | 南 | 1.573 | 投 | -0.127 |
屏 東 | 屏 | 3.646 | 東 | 1.402 |
嘉 義 | 嘉 | 2.194 | 義 | 0.926 |
新 竹 | 新 | 0.405 | 竹 | 0.765 |
宜 蘭 | 宜 | 0.673 | 蘭 | 1.469 |
References
Huang, Chu-Ren, Kathleen Ahrens, and Keh-Jiann Chen. 1994. "A data-driven approach to psychological reality of the mental lexicon: Two studies on chinese corpus linguistics." In Language and its Psychobiological Bases, Taipei.
Huang, Chu-Ren, Wei-Mei Hong, and Keh-Jiann Chen. 1994. Suoxie: An information based lexical rule of abbreviation. In Proceedings of the Second Pacific Asia Conference on Formal and Computational Linguistics II, pages 49-52, Japan.