Richard Sproat

Jangobun Tomb (장고분/長鼓墳), Wolgyedong (월계동/月渓洞), Gwangju (광주/光州), Korea, April, 2024	Richard Sproat
	리차드 스프로트
	リチャード・スプロート
	史伯樂 ರಿಚರ್ಡ್ ಸ್ಪ್ರೋಟ್

	Research Scientist
	Sakana.ai
	Toranomon Hills Business Tower 15F
	1-17-1 Toranomon, Minato-ku
	Tokyo 105-6415 Japan

	Click here for other stuff. *I have a new video course, Introduction to Writing Systems: How Writing Encodes Language, out from Springer.*

I am a computational linguist (which means that I have some things in common with grapefruit).

I am a Research Scientist at Sakana.ai in Tokyo.

Prior to that I was a Research Scientist at Google, formerly in New York, then in Tokyo.

From January, 2009, through October 2012, I was a professor at the Center for Spoken Language Understanding at the Oregon Health and Science University.

Prior to going to OHSU, I was a professor in the departments of Linguistics and Electrical and Computer Engineering at the University of Illinois at Urbana-Champaign. I was also a full-time faculty member at the Beckman Institute. I still hold adjunct positions in Linguistics and ECE at UIUC.

Research Interests

At Google I have been mostly working on text normalization, where my former group has been developing various machine learning approaches to the problem of normalizing non-standard words in text and I have been particularly interested in the promise (and limitations) of approaches using recurrent neural nets. As of September 2019, I have moved to Google Tokyo, and am working on end-to-end speech understanding.

I continue to maintain some "side-bar interests" including computational models of the early evolution of writing, the statistical properties of non-linguistic symbol systems, and collaborating on a translation of Wolfgang von Kempelen's Mechanismus der menschlichen Sprache, which was published in 2017.

Prior to coming to Google I was working on several projects, some of which are still current in the sense that my collaborators are still working on them.

An NSF-funded project on text-to-scene conversion: NSF: "RI: Medium: Collaborative Research: From Text to Pictures". In our piece of the project we are looking at using Amazon's Mechanical Turk to fill in semantic information into a frame-like ontology (e.g. parts of objects, how objects are typically used, frame information about verbs, etc.)
An NIH (R01) funded project on "Computational characterization of language use in autism spectrum disorder".
An NIH (K25) funded project on "NLP for Augmentative and Alternative Communication in Adults". The plan is to develop systems that can predict a set of plausible responses for an AAC-system user given the discourse context. In Spring of 2010, I taught (with Brian Roark) a Seminar on Speech and Language Processing for Augmentative and Alternative Communication.
An NSF-funded project on Corpora of Non-Linguistic Symbol Systems.
A CIA-funded project on text-normalization and genre adaptation for social media texts.
OpenGRM: I am involved in open-sourcing various tools for finite-state and other language processing, including some of the Google tools for grammar development.

Some Previous Research Interests

Multilingual spoken term detection: I was team leader of the 2008 Johns Hopkins Center for Language and Speech Processing workshop on this topic.
Language modeling for colloquial Arabic speech recognition.
Named entity detection and transliteration for multiple languages. Web page.
Prediction of prosody from text for affective speech synthesis. Web page.
Acoustic and pronunciation modeling for accented standard Chinese. (Follow up on work done at WS'04)
The relation to layout to phonological awareness in scripts of South Asia.
Automated Methods for Second-Language Fluency Assessment. Web page (password protected).
Interpretation of location descriptions for botanical/zoological specimen labels.
Other serious stuff.

I am very interested in writing systems; see some work I was doing on approximate string matching in the Easter Island rongorongo script. I also ran (with Jerry Packard) a reading group centered around Hannas' controversial thesis relating Asia's supposed technological creativity gap, with the Chinese writing system.

Prior to Joining Academia

Before joining the faculty at UIUC I worked in the Information Systems and Analysis Research Department headed by Ken Church at AT&T Labs --- Research where I worked on Speech and Text Data Mining: extracting potentially useful information from large speech or text databases using a combination of speech/NLP technology and data mining techniques.

Before joining Ken's department I worked in the Human/Computer Interaction Research Department headed by Candy Kamm. My most recent project in that department was WordsEye, an automatic text-to-scene conversion system. The WordsEye technology is now being developed at Semantic Light, LLC. WordsEye is particularly good for creating surrealistic images that I can easily conceive of but are well beyond my artistic ability to execute. And we were doing this 20 years before the "AI revolution". All of the following images were generated from text descriptions of the scene. Click on the images to see the text that generated the scene:

Prior to joining AT&T Labs in 1999 I worked on Text-to-Speech Synthesis at Bell Labs, Lucent Technologies. Among other things, I was responsible for the multilingual text processing module of the Bell Labs Multilingual TTS System.

More and sometimes less recent Stuff

I have a new video course, Introduction to Writing Systems: How Writing Encodes Language, out from Springer.

As of September 1, 2024, I am joining Sakana AI.

My new book on Symbols is out from Springer.

Alexander Gutkin's and my recent paper on defining logography.
A brief summary of the results of the recent Kaggle Text Normalization Challenge.

Jürgen Trouvain's page with a link to the PDF of our new bilingual edition of Kempelen's Mechanismus der menschlichen Sprache.

What the more fortunate of us might do if we get a tax rebate courtesy of the Koch Brothers.

A blog on the Biology & Chemistry Department at Liberty University.

An Op-Ed-style essay on the 2014 Association for Computational Linguistics Lifetime Achievement Award winner Robert Mercer, Trump, and why it's a pity the ACL does not have a Code of Ethics (and likely wouldn't apply it if they did.)

A new piece "Defending Democracy in an Illiberal Age" by Shalom Lappin and myself.

A short essay about the recent US elections, mostly to get this off my chest.

A new paper on our ongoing work on applying Recurrent Neural Networks to the problem of text normalization for speech synthesis.

Slides for a presentation on the computational simulation of the early evolution of writing, presented at the second conference on Signs of Writing: The Cultural, Social, and Linguistic Contexts of the World’s First Writing Systems, Beijing, June 2015. Also my paper from the first conference in Chicago, November 2014.

A press release from the Linguistic Society of America on my new paper in Language that shows that previously published statistical claims about Indus Valley Symbols and Pictish Symbols were wrong.

My opinion on Wikipedia entries for living persons in my field (and beyond).

Interview on WNYC's New Tech City here. I was trying to explain why no graphical form of communication will ever replace writing. Also a piece that focuses on the discussion of Blissymbolics here.

I am starting a collection of data on Indus Weights here.

As of June 1, 2013, I am the new Editor in Chief of the ACM Transactions on Asian Language Information Processing.

See here for a new monograph on statistical analyses of a number of corpora of linguistic and non-linguistic symbol systems.

I was General Chair of InterSpeech 2012.

My copy of Wolfgang von Kempelen's book is now online here.

A response to the recent paper in Science by Quentin Atkinson, which has (predictably) received a lot of press. Science does it again, choosing to publish a nice-sounding story that does not stand up to even mildly serious scrutiny.

Update to this page with scans done by the OHSU library of the first 100 pages of my 1791 first edition of Wolfgang von Kempelen's Mechanismus der menschlichen Sprache nebst der Beschreibung seiner sprechenden Maschine.

My response to Rao et al's reply to my Computational Linguistics piece (below).

A new paper on the reviewing practices of the general science journals: "Ancient symbols, computational linguistics, and the reviewing practices of the general science journals." Computational Linguistics, 36:3, 2010.

A shocking finding on the relation between literacy and the ratio of male/female births in the population.

A particularly stupid article in Abu Dhabi-based The National on the continuing saga of the statistical "evidence" for the Indus Script thesis prompted me to update this page.
Some photos of my recently acquired 1791 first edition of Wolfgang von Kempelen's Mechanismus der menschlichen Sprache nebst der Beschreibung seiner sprechenden Maschine. This is the very first serious study of articulatory phonetics, and the first description of a mechanical speech synthesizer.

My two cents' worth on the continuing credulous reports on the Rao et al. work on the Indus symbols.

Some musings on the current state of computational linguistics.

Slides from Kevin Knight's and my tutorial at NAACL in Boulder.

A refutation of the supposed evidence from Rao and colleagues that the Indus Valley symbol system was a script. Also: a simple python program that does a pretty good simulation of Rao et al's results, assuming a Zipfian distribution of characters for a 400-character vocabulary, and conditional independence of the characters. In other words, you get the same behavior as Rao shows for the Indus corpus even with a model that has no syntactic dependence between the glyphs whatsoever. See also Mark Liberman's entry on the Language Log, and Fernando Pereira's skewering of this paper, as well as Science. Here is the letter that we sent to Science, but which they refused to publish. Of course they cited lack of space. But it's very hard to see what kind of letter would be more worthy of space than one that points out a set of fatal flaws in a "peer-reviewed" publication that appeared in Science. Finally, here's a plot that demonstrates, using Rao et al's technique, that European heraldry is a linguistic symbol system. It also seems to show that Amharic (a semitic language) is closely related to the Dravidian language Tamil.

General talk about computational linguistics at ACM Reflections/Projections 2008.

2008 Johns Hopkins CLSP Summer Workshop on Multilingual Spoken Term Detection. Final slides.

Talk on evolutionary modeling of morphology in UIUC Linguistics Seminar, May 1, 2008. Same talk at QITL-3 (Helsinki), and the Max Planck Institute for Evolutionary Anthropology (Leipzig) here.

My recent visit to the Creation Museum in Kentucky.

Talk on the Phaistos Disk in the September 6, 2007, Linguistics Seminar at UIUC.

Co-organizer (with Steve Farmer) of a workshop on Scripts, Non-scripts and (Pseudo)-decipherment, July 11 2007, to be held in conjunction with the 2007 LSA Summer Institute at Stanford University.

Guest lecture on WordsEye in LING 588, Spring 2007.

In Spring 2007 I am teaching a new, and I believe unique, course entitled Language, Technology and Society (LING270). The course covers language-related technology from the earliest writing systems all the way up to modern speech and language processing. It also explores the social implications of some of these technologies.

I was technical co-chair (with Yuji Matsumoto) of the 21st International Conference on Computer Processing of Oriental Languages, December 17--19, 2006.

I was co-chair (with Dan Roth) of the Third Midwest Computational Linguistics Colloquium.

Presentation at SALA 25.

Guidance for researchers contemplating doing joint research with colleagues in India.

Keynote address at Second Midwest Computational Linguistics Colloquium.

Slides for my talk for the April 22, 2005, Beckman Institute Director's seminar.

See Shalom Lappin's and my challenge to the Minimalist Program.

Slides from my tutorial, with Tim Buckwalter, at the Arabic Linguistics Symposium, April 3, 2005, UIUC.

A travelog from India.

A new article with Steve Farmer and Michael Witzel in the Electronic Journal of Vedic Studies, 11(2), 2004, argues that the so-called Indus Valley script was not a writing system at all. The December 17, 2004 issue of Science ran a feature on our work.
Evidently this article has caused a bit of a stir in some circles. See the related challenge (worth $10,000!!) to prove that I'm an idiot.