Sproat, COLING 2002 Tutorial, Association Measures

Consider the four association measures: mutual information, weighted mutual information, Chi-square and Dunning's likelihood ratios.

For each measure, let's rate its efficacy at extracting reasonable two-character words words from the ROCLING corpus (10-million characters). Below you'll find links to a list of the 500 most highly associated terms according to the measure in question, with the only restriction being that the term must occur at least five times. (This restriction reduces the problem with low counts that association measures like mutual information have.) Each line lists the association score, f(c1c2), f(c1), f(c2), and c1c2. Examples that don't seem like words are starred in the first position. (Thanks to Chilin Shih for providing the judgments.) You may of course wish to decide for yourself which pairs constitute words.

If you were a lexicographer and wanted to find 500 good terms without having to edit out too many non-terms, which measure would you choose?

Do you notice any other interesting feature of highly associated character pairs, with some of the measures of association?