Sproat, R. (2014). "A Statistical Comparison of Written Language and Nonlinguistic Symbol Systems". Language, vol. 90(2), 457-481, June 2014.
For a description of some of the data, see also:
Katherine Wu, Jennifer Solman, Ruth Linehan and Richard Sproat. "Corpora of Non-Linguistic Symbol Systems." Linguistic Society of America, Portland, OR, January 2012.
The file corpora.zip contains all the corpora in XML format, along with the XML schema.
tools.zip contains a simple Python program xtract.py to extract data in various ways from the XML format. It also contains ngram-entropy, used to compute entropies on the output of an ngram model as computed using Open GRM's ngram library. See the README for a typical use case.
Errata:
© 2014 and onwards, Richard Sproat