There is an ever increasing demand for speech synthesis (TTS) technology in various applications including e-mail reading, information access over the web, tutorial and language-teaching applications, and in assistive technology for users with various handicaps. Invariably, an application that was developed with a particular TTS system A cannot be ported, without a fair amount of additional work, to a new TTS system B, for the simple reason that the tag set used to control system A is completely different from those used to control system B. The large variety of tagsets used by TTS systems are thus a problem for the expanded use of this technology since developers are often unwilling to expend effort porting their applications to a new TTS system, even if the new system in question is of demonstrably higher quality than the one they are currently using.
SABLE is an XML (Extensible Markup Language)/SGML (Standard Generalized Markup Language)-based [2,1] markup scheme for text-to-speech synthesis, developed to address the need for a common TTS control paradigm. SABLE is based in part on two previous proposals by a subset of the present authors: the Spoken Text Markup Language (STML -- [4]; and see also [6] for an even earlier proposal -- SSML) and the Java Speech Markup Language (JSML -- [5]).
The SABLE markup language is being developed with the following goals in mind:
In both its generality and its coverage, SABLE has many advantages over existing markups such as Microsoft's SAPI [3], or Apple's Speech Manager control set. Whereas the syntax of other schemes is typically ad hoc, SABLE's is based on XML/SGML, a widely-used standard. Secondly, SAPI and other markup schemes provide tags only for speaker directives, not for text description. Text-description information, for example, that a particular boundary in a text corresponds to the end of a line in a table (e.g., <DIV TYPE="x-tl">), can in principle be used by a TTS system to advantage to produce reasonable speech output that marks auditorily the presence of that boundary. One does not necessarily want to have to instruct the synthesizer to use a particular intonation pattern, or to implement the break in a particular fashion: one might prefer simply to mark the presence of the boundary in an abstract way, and assume that the system will do something reasonable with that information. Text-description is explicitly designed to allow that kind of abstract specification.