This page is under construction.

Approximate String Matches in the rongorongo Corpus.



1. Synopsis

It has been known since the 1940's that several of the extant Easter Island rongorongo tablets have parallel texts, and over the years several new parallel partial matches have been found (Barthel, 1958; Guy, 1985; Fischer, 1997; rongorongo, 2000).

The present project aims to discover partial matches in the corpus of extant tablets using approximate string matching techniques. The basic method is to compute a suffix array (Manber and Myers, 1993) over the entire corpus. This has the effect of grouping together all suffixes in the corpus that start with the same glyph. Within each group of suffixes that start with the same glyph we compute an approximate string match using the algorithm described in the introduction to Sankoff and Kruskal (1983). As a further constraint we try to match only strings of certain lengths -- in this incarnation strings of length 5, 10, 15, ..., 120, 125, 130 -- and we insist upon a maximum mismatch k of 20% of the base length. We also insist that the last two glyphs in the two strings match. Thus a string of length 10 might match with a string of length 9 that was an edit distance of two (one substitution and one deletion) from the original string, as long as the two strings begin and end on the same glyph.

The motivation for insisting on matched strings starting on the same glyph and ending on the same glyph is to reduce the amount of search and the number of returned "duplicate" matches. Clearly if we have two strings s1, s2 with lengths m = |s1| and n = |s2|, respectively, and if at least (1-k)*m characters, (1-k)*m >= 2, must match, then there must be substrings s'1 and s'2, respectively, such that s'1[0] = s'2[0] and s'1[m'] = s'2[n'] where m' = |s'1| and n' = |s'2|. Then, even if the "true" matches are between s1 and s2, we will be able to find those by inspection by looking in the contexts of the found matches in s'1 and s'2.

Note that the longest length we found an approximate match under these conditions was 125 glyphs, between recto line 2, glyph 36 of the Great Santiago and recto line 2, glyph 0 of the Small St. Petersburg.

The data for the corpus and the images were retrieved from the excellent rongorongo website. The matches were computed over a "reduced" version of the Barthel set, which is essentially the Barthel set with the various diacritics removed. Thus a string like:


would be represented as:


This of course makes the implicit assumption that the various forms of the glyph included by Barthel under the same basic numerical code are in fact just variants of the same glyph rather than separate glyphs.

2. Some Results

  1. See here for a listing of matches ordered by tablet (about 197K: be patient, this may take a while for your browser to display).
  2. See here for a listing of matches ordered by match length (about 197K: be patient, this may take a while for your browser to display).
  3. See here (or here for a PDF version) for a plot that gives a synopsis of the matches for the entire corpus. The key to the tablet abbreviations can be found here. In the plot the red lines indicate tablet divisions and the turquoise lines indicate line divisions within the tablet (with the ordering assumed by Barthel, rather than Fischer, which differ for some tablets). The black points represent matches, with an approximate match of, say, ten glyphs being represented by a line composed of ten dots. The tablet names are indicated on the horizontal and vertical axes, though names of the shorter tablets are unfortunately occulted.

The plot immediately reveals the long shared portions of the Great Santiago and the Great and Small St. Petersburgs, discussed elsewhere, as well as the parallels between the Small Santiago and the London table. Other shorter matches between various tablets are also revealed. Also striking is the fact that the Santiago Staff seems to be an isolate, matching with almost nothing else except itself. The reason for this is presumably the abundance in this text of the "phallus" glyph (Barthel 76, two forms of which are: ), 83% of the tokens of which occur in the Staff, and of the vertical separator (coded as 999), which occurs nowhere else. The "phallus" glyph led to Fischer's claim (1995) that the text in the Santiago Staff is a procreation chant with repeated formulae of the form X ki `ai ki roto `o Y: ka pu te Z `X copulated with Y: there issued forth Z'. He has since claimed that other texts are also procreation chants, albeit sans phallus: see, e.g. (Fischer, 1997, page 444), where he claims that he "could demonstrate that isolated segments on [the Small Santiago, verso] were procreation chants". If other texts were like the Santiago Staff, one might expect to see more approximate matches. Fischer has an "explanation" for this: he assumes that in many of the other texts, the "phallus" was simply omitted. Of course, with sufficient assumptions about what may be present, any string can match with any other string, so it's not clear how one would falsify Fischer's claim in the absence of independent evidence. One is inclined to agree with Guy's assessment:

Fischer's lack of method does not stop there. In another article, published in the Rapa Nui Journal, he claims to have identified similar copulation stories on "eleven other tablets, all of them lacking the phallic suffix". In other words, wherever he did not see a phallus, he supplied one.

As an attempt at a test for Fischer's "phallus omission" assumption, we computed the same string matches for a version of the corpus where glyph 76, the phallus symbol, had been removed. Presumably if many parts of the other tablets are really texts that are like the Santiago Staff, albeit sans explicit phallus, one ought to increase one's chance of finding matches between the Staff and other tablets by removing the offending member. The results (PDF version) were the same as for the unadulterated version of the corpus: the Santiago staff still appears as an isolate.

Note that the listings in comp1.html and comp2.html are not the complete set of matches in that we only keep the longest match between line n of tablet X and line m of tablet Y. In general those matches not being shown are merely subsets of the ones that are shown.

(I am finding some cases where there were portions of the transcription missing, due to imperfect processing of the text retrieved from the website. I am in the process of fixing those errors.)

3. Partial List of References

  1. Barthel, Thomas, 1958. Grundlagen zur Entzifferung der Osterinselschrift. Abhandlungen aus dem Gebiet der Auslandskunde 64. Reihe B. vol 36. Hamburg: Cram, de Gruyter & Co.
  2. Fischer, Stephen Roger. 1995. "Preliminary Evidence for Cosmogonic Texts in Rapanui's Rongorongo Inscriptions". Journal of the Polynesian Society. 104: 303-21.
  3. Fischer, Stephen Roger. 1997. rongorongo, The Easter Island Script: History, Traditions, Texts. Oxford University Press.
  4. Guy, Jacques. 1985. "On a fragment of the `Tahua' tablet." Journal of the Polynesian Society. 94:367-88.
  5. Manber, Udi and E. Myers. 1993. "Suffix arrays: a new method for on-line string searches." SIAM J. on Computing. 22(5):935--948
  6. rongorongo web site. 2000.
  7. Sankoff, David and Kruskal, Joseph. 1983. Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. CSLI Publications.

This page was last modified January 11, 2003.