This contains the corpora and some of the tools cited in:

Sproat, R. (2014). "A Statistical Comparison of Written Language and Nonlinguistic Symbol Systems". Language, vol. 90(2), 457-481, June 2014.

For a description of some of the data, see also:

Katherine Wu, Jennifer Solman, Ruth Linehan and Richard Sproat. "Corpora of Non-Linguistic Symbol Systems." Linguistic Society of America, Portland, OR, January 2012.

The file corpora.zip contains all the corpora in XML format, along with the XML schema.

tools.zip contains a simple Python program xtract.py to extract data in various ways from the XML format. It also contains ngram-entropy, used to compute entropies on the output of an ngram model as computed using Open GRM's ngram library. See the README for a typical use case.

Errata:


© 2014 and onwards, Richard Sproat