Sproat: Non-linguistic corpora

This contains the corpora and some of the tools cited in:

Sproat, R. (2014). "A Statistical Comparison of Written Language and Nonlinguistic Symbol Systems". Language, vol. 90(2), 457-481, June 2014.

For a description of some of the data, see also:

Katherine Wu, Jennifer Solman, Ruth Linehan and Richard Sproat. "Corpora of Non-Linguistic Symbol Systems." Linguistic Society of America, Portland, OR, January 2012.

The file corpora.zip contains all the corpora in XML format, along with the XML schema.

tools.zip contains a simple Python program xtract.py to extract data in various ways from the XML format. It also contains ngram-entropy, used to compute entropies on the output of an ngram model as computed using Open GRM's ngram library. See the README for a typical use case.

Errata:

References to the "Wilcoxon signed rank test" in the main paper should be replaced with "Wilcoxon rank sum test".
On page 476, the second bullet item from bottom of bullet list makes a reference to "Luwian" and "Hittite hieroglyphs". Of course Hittite hieroglyphs are really Luwian, and should have been deleted here.