The paper "Entropic Evidence for Linguistic Structure in the Indus Script" (April 23) by Rao and colleagues (henceforth Rao) proposes that bigram conditional entropy (which quantifies the predictability of a set of symbols given the previous symbol) provides clear evidence that the Indus inscriptions were a writing system, rather than a non-linguistic symbol system. This is an attempt to refute a 2004 paper by the three of us ("Collapse of the Indus Script Thesis" --- henceforth FSW). But Rao's rebuttal fails, for several reasons. First, though FSW included statistical arguments, our most important arguments against the traditional script idea were based on extensive archaeological evidence. Rao does not mention this evidence. Second, Rao's "representative examples" of non-linguistic symbol systems are artificial sets of 200,000 totally ordered signs (where entropy is minimal) and 200,000 totally randomized signs (where entropy is maximal). Despite their claims to the contrary, neither correspond to anything remotely resembling premodern non-linguistic sign systems. Since there are systems (European heraldry, mathematical systems, etc.) that are neither random nor completely rigidly ordered, at best one can say that Rao grossly undersampled the space. Their inclusion of data that have nothing to do with man-made symbols instead (DNA and protein sequences) is odd to say the least. Thirdly, no single statistical measure can decide the complex matter of whether a symbol system is linguistic or not. A given conditional entropy profile is consistent with many underlying models. And Zipfian distributions (discussed in Rao's supplement), are true of linguistic and many non-linguistic systems alike (for example, the distributions of city sizes). In summary, Rao has not seriously addressed any of the key issues raised by FSW. Certainly, simple statistical measures based on an impoverished and partly artificial set of comparisons are not evidence for anything. Richard Sproat Steve Farmer Michael Witzel ---- Since this letter was written a "response" appeared from Rao et al at http://www.cs.washington.edu/homes/rao/IndusResponse.pdf For now we'll just draw attention to one bizarre claim here: "The artificial data sets in our work represent controls, necessary in any scientific investigation, which delineate the limits of what is possible. The two controls in our work represent sequences with maximum and minimum flexibility, for a given number of tokens. Though this can be computed analytically, the data sets were generated to subject them to the same parameter estimation process as the other data sets. Our conclusions do not depend on the controls, but are based on comparisons with real world data: DNA and protein sequences, various natural languages, and FORTRAN computer code. All our real world examples are bounded by the maximum and the minimum provided by the controls, which thus serve as a check on the computation." The "artificial data sets" referred to here are, of course, the Types 1 and 2 "non-linguistic symbol systems" from the original paper. This is an odd switch of position to say the least. In the paper that appeared in Science, readers were clearly told that Types 1 and 2 represented plausible kinds of non-linguistic symbol systems. We remind readers of what they said in the original paper: "Two major types of nonlinguistic systems are those that do not exhibit much sequential structure ('Type 1' systems) and those that follow rigid sequential order ('Type 2' systems). For example, the sequential order of signs in Vinča inscriptions appears to have been unimportant (4). On the other hand, the sequences of deity signs in Near Eastern inscriptions found on boundary stones (kudurrus) typically follow a rigid order that is thought to reflect the hierarchical ordering of the deities (5). Linguistic systems tend to fall somewhere between these two extremes: the tokens of a language (such as characters or words) do not follow each other randomly nor are they juxtaposed in a rigid order." So the original claim was that major types of non-linguistic systems are these two extremes. Now, instead, these are merely bounds on the conditional entropy measure. So, we assume, they are retracting their claim that these represented actual non-linguistic symbol systems. If so, then we are even more puzzled why Figure 1 in their paper failed to include the one example of a human-created non-linguistic symbol system --- Fortran. This appears in Figure 2, but why was the "growth curve" for Fortran not shown in Figure 1? In any case, given that Rao et al keep emphasizing how "scientific" their approach was (Rao was quoted saying thus to the press, and they talk here about how their method follows that of "any scientific investigation"), they presumably ought to have known that to make any valid claims about the taxonomy of an unknown system, one needs to consider a wide range of examples of each type. Even if Types 1 and 2 WERE reasonable representatives of non-linguistic systems, where in their paper does one find the reasonable sampling of man-made non-linguistic systems that one would expect to see in a carefully controlled study that purports to demonstrate that the Indus symbols were a script? Where are the two populations --- a decent sample of linguistic signs, a decent sample of non-linguistic signs --- that would be required of ANY scientific study? Basic scientific rigor that all of us should have learned in high school tells us that in order to assert that unknown entity x belongs to population Y rather than population Z, you need to have a decent sample of both Y and Z, and then compute some plausible statistic that shows that the probability that x belongs to Y is lower than the probability that it belongs to Z. It ain't enough to just have a couple of samples of Y, few if any of Z, and then declare, hey it looks like Y. If one designed an experiment like that in a high-school science class, one would get a failing grade on that assignment. It's depressing that Rao and colleagues do not seem to understand this. But then apparently neither do the reviewers and editors for Science magazine, which is surely even more depressing. Thus Rao et al's impressive-seeming Figure 1 contains exactly three natural languages, the Indus symbols and, if we are to believe what they now claim about the purpose behind Type 1 and Type 2, NO NON-LINGUISTIC SYSTEMS WHATSOEVER. We remind the reader again that the original 2004 paper by us considered the Indus symbols in comparison to many ancient and modern symbol systems --- linguistic and otherwise. More to the point, we did not attempt to establish anything on the basis of statistical measures alone. Rao et al's response listed above discusses some of the non-statistical arguments in our paper, but the arguments they raise have already been addressed. For example, the supposed 26-glyph Indus inscription has been dealt with: see, for example, http://www.safarmer.com/indus/longestinscription.htm. Indeed the numbers of topics that keep resurfacing in this debate (though not necessarily in Rao's discussion) is impressive. In blogs people keep raising the existence of the Dravidian language Brahui in the general region of the Indus as evidence of possible Dravidian populations in that region in Ancient times: but few Indologists believe that Brahui is due to anything other than a recent migration. The proposal that maybe the Indus "script" was some sort of ideographic script keeps resurfacing even though there has never been a single documented case of an ideographic script in the entire history of writing systems: any real writing system MUST encode phonetic information. This is not because of some gerrymandered definition of writing concocted by linguists: it is simply because it is well-nigh impossible to build a full writing system in any other way. Charles Bliss, despite his best efforts of a lifetime, was never able to construct a fully expressive writing system based on semantics alone; his system, Blissymbolics, only ever found any successful application with people with severe cognitive impairements that rendered any communication limited anyway. The rejoinder to that is often: what about Egyptian and Chinese? Aren't they ideographic? This merely underscores the ignorance of many of the people engaged in this debate.