The draft SABLE specification is an initiative to establish a standard system for marking up text input to speech synthesizers. The current draft (version 1.0) is being circulated for comment by users, developers and researchers of speech synthesis.
See also a paper describing SABLE, and a position paper describing the relation between SABLE and the W3C's Aural Cascaded Style Sheets.
Finally, see the Linguistic Data Consortium's web page on Linguistic Annotation.
The name SABLE is tentative. At some time it may change to ??ML for ? ? Markup Language.
Currently, speech synthesizers are controlled by a multitude of proprietary tag sets. These tag sets vary substantially across synthesizers and are an inhibitor to the adoption of speech synthesis technology by developers.
This SABLE markup language is being developed with the following goals in mind:
The SABLE specification evolved as an initiative to combine three existing speech synthesis markup languages:
MARK | Character-string identifier for this tag. |
Attributes:
LEVEL | Defines the level of emphasis. Either a floating point
number or integer greater than or equal to 0.0; or a descriptive term,
with the following numerical interpretations:
|
Properties:
Example:
"The leaders of <EMPH>Denmark</EMPH> and <EMPH>India</EMPH> meet on Friday."
Attributes:
LEVEL | Defines the break level. Either a floating point
number or integer; or a descriptive term,
with the following numerical interpretations:
|
||||||||
MSEC | A floating point number or integer greater than or equal to zero defining the length of the pause associated with this break. Default is that appropriate for a break of the defined LEVEL. | ||||||||
TYPE | A punctuation symbol that represents (roughly) the kind of intonation contour to be associated with the utterance preceding the BREAK: currently proposed values are "?" ("sounds like a question"); "!" ("sounds like an exclamation"); "." ("sounds like a statement"); "," ("sounds as if there is more coming"). |
Properties:
Example:
"Without style, <BREAK LEVEL="large"> Grace and I are in trouble."
Attributes:
BASE | Sets the bottom, or "base" line of the intonation.
A specification in one of the following formats:
|
||||||||||||||||||
MIDDLE | Sets the middle, or "reference" line of the intonation.
A specification in one of the following formats:
|
||||||||||||||||||
RANGE | Sets the "pitch range" of the intonation.
A specification in one of the following formats:
|
Properties:
Note that unlike some other markup schemes, there is no explicit tag for downstep or declination. These can, however, be implemented by appropriate resettings of BASE, MIDDLE and RANGE.
Example:
"Without his penguin, <PITCH BASE="-20%"> which he left at home, </PITCH> he could not enter the restaurant."
Attributes:
SPEED | Sets the speed.
A specification in one of the following formats:
|
Properties:
The term words per minute is to be understood rather loosely, and is probably language-dependent in its interpretation. For English speakers who are used to thinking in terms of orthographic words (i.e., words that are defined by surrounding whitespace in a text), the normal notion of words-per-minute (commonly used to define rate of speech or rate of timing), should apply. For Japanese, where non-linguists are not used to thinking in terms of words, then another measure might be more appropriate: e.g. "bunsetsu-per-minute". The tag should really be interpreted as something like LANGUAGE-APPROPRIATE-MINIMAL-INDEPENDENT-UNIT-PER-MINUTE. For reasons that should be obvious this would, alas, not make a very good tag name.
Example:
"The address is <RATE SPEED="-20%"> 10 Main Street </RATE>."
Attributes:
LEVEL | Defines the amplitude level.
A specification in one of the following formats:
Default is medium.
|
Properties:
This tag sets only the volume. Associated phonation changes are not implied. Thus quiet is not a whisper.
Example:
"Please speak more <VOLUME LEVEL="loud">loudly</VOLUME>."
Attributes:
SRC | URL of a document with an appropriate mime-type | ||||
MODE | Either of the following:
|
||||
LEVEL | A floating point number above 0.0. 1.0 is the same level as the original audio, 0.0 is silent. If not specified, the engine should scale the SRC's amplitude to be approximately that of the surrounding speech. |
Properties:
AUDIO is not a required tag in a SABLE-conformant system: it is recognized that not all engines/systems may be able to support it. Furthermore, it is acceptable if a system supports some audio types (e.g. .au, .aiff), but not others (e.g. .wav, real audio).
Example:
"Beethoven <AUDIO SRC="5th.au"> and Tchaikovsky <AUDIO SRC="1812.wav"> wrote good music!"
Attributes:
ID | Identifier for the specific TTS engine. |
DATA | Any character string to be substituted for the contained text. |
Properties:
The ENGINE tag allows one to select a specific text to be substituted for the contained text for a given synthesizer, if one happens to be using that synthesizer to read the given SABLE document. It also serves as a way to pass engine-specific controls to a given engine: this can be implemented by using the ENGINE tag to enclose empty text, and having the DATA be the control string. Engines other than the one specified by ID are free to ignore this tag, or may attempt to interpret it if they think they are able to.
Example:
"The <ENGINE ID="acme synth" DATA="wonderful, fantastic acme synthesizer"> Acme synthesizer</ENGINE>."
On an Acme system it says "wonderful, fantastic acme synthesizer". On other systems, it says just "Acme synthesizer".
Attributes:
MARK | Character-string identifier for this tag. |
Properties:
MARK is an attribute of any SABLE tag. However, there may be instances where one wants to set a MARK, but where no specific tag is appropriate. MARKER should be used in such instances.
Example:
"Move the <MARKER MARK="mouse"> mouse to the top."
No Attributes.
Properties:
Example:
<SABLE>
The text to be spoken goes here.
It might include special tags.
</SABLE>
Attributes:
IPA | Character string in Unicode IPA describing the pronunciation to be used for the contained text. |
SUB | Character string representing an attempt at "phonetic" spelling (in the language of the enclosing text) for the contained text. |
ORIGIN | Identifier for the language of origin of the enclosed text, following the iso639 scheme. |
Properties:
The IPA attribute is provided to allow for a precise phonetic rendering for the contained text. Recognizing that many developers may not be experienced with IPA, or other formal phonetic transcription schemes, SABLE provides an alternative method for specifying the desired pronunciation, using the SUB tag. Using this tag, an application may substitute for the contained text an attempt at phonetically spelling the text. Thus, one might want to specify the British pronunciation of "tomato" as follows:
<PRON SUB="tomahto">tomato</PRON>
Needless to say, it will depend upon the engine being used whether this will actually result in an appropriate pronunciation. This is unavoidable, and developers who desire both precision and portability should make the effort to learn IPA.
For languages that have a conventional, or semi-conventional "phonetic" writing scheme, in addition to, or part of their normal orthography, the SUB attribute is an appropriate way to include intended pronunciations transcribed in that scheme. For example, if one wants to specify the exact pronunciation of a Japanese personal name that is normally written in kanji (Chinese characters), one could specify its pronunciation with the SUB attribute of PRON, using a transcription in kana. Similarly for Korean, if one is using older-style mixed Korean text with Chinese characters, one might specify the pronunciation using the SUB attribute using hankul. For Mandarin Chinese, one might use either pinyin or zhuyin fuhao (Mandarin phonetic symbol set) in that field, though it is likely to be engine specific which one of these will be supported.
ORIGIN may be used to specify that the enclosed text comes from a particular language, and may be pronounced accordingly by the engine:
This is all rather <PRON ORIGIN=fr>passe</PRON>
If both IPA and SUB are specified, IPA takes precedence. PRON instances that specify no attribute are ignored.
Attributes:
MODE | Mode in which to say the contained text. Currently
supported values are:
|
||||||||||||||||||||||||||
MODETYPE | Secondary specification further qualifying
MODE. The following values are defined for the given MODE value:
|
Properties:
SAYAS instances with no specified MODE are ignored. A date without a specified MODETYPE interpreted as best as the engine can, possibly taking account of the locale (e.g. MDY in USA, DMY in most other countries).
Example:
At <SAYAS MODE="time">2pm</SAYAS> on <SAYAS MODE="date" MODETYPE="YM"> 98/3</SAYAS> Mike will send <SAYAS MODE="currency">$4000</SAYAS> to <SAYAS MODE="net" MODETYPE="email">me@acme.com</SAYAS>.
Attributes:
ID | Identifier for the desired language, following the iso639 scheme, or dialect following the RFC1766. |
CODE | Optional id for the encoding scheme used for the language. |
Properties:
LANGUAGE can tag a region of any size. However, for most applications, and for most TTS systems, it will not be desirable to switch languages within a sentence.
Unless the SPEAKER is also specified, changing to a new language will result in using the default speaker for that language.
LANGUAGE instances without an associated ID specification will be ignored.
If the CODE attribute is not specified then it should default to the engine-specific default for the given language. In most cases it should not be necessary to specify it. The main instances in which it would be used are in cases where there is more than one viable option for encoding the script for the language. For example Chinese could be coded in either GB or Big5.
Example:
<LANGUAGE ID="en">Some text in English.</LANGUAGE>
<LANGUAGE ID="de">Ein deutscher Satz.</LANGUAGE>
Attributes:
GENDER | Gender for the desired speaker.
Values are:
Default is the default gender for the engine. |
||||||||||
AGE | Description of the age of the desired speaker. Values
are:
Default is the default "age" for the engine. |
||||||||||
NAME | Name of a speaker if a particular engine is being used. |
Properties:
If NAME is specified, then it (may) override the other specifications of AGE and GENDER. If the system does not have a speaker with the given NAME, then the GENDER and AGE specifications (or their defaults) are used.
Example:
<SPEAKER GENDER="male" AGE="child">I'm a young boy!</SPEAKER>
Attributes:
TYPE | Type of the division. Currently allowed values are:
|
Properties:
Currently recommended types are only SENTENCE and PARAGRAPH. However, it is intended that DIV be used to support any reasonable division within a text. For instance, in the relatively unlikely event of a TTS system reading poetry, <DIV TYPE=line> and <DIV TYPE=stanza> might be reasonable. Similarly, in a transcription of a dialogue, DIV tags marking turn-taking might also be desirable. Indeed, such extensions to the values for TYPE are legal SABLE, though one cannot of course expect portability unless the TYPE is explicitly defined for the standard.
DIV instances with no specified TYPE are ignored.
Example:
<DIV TYPE="paragraph">
<DIV TYPE="sentence" >
Yesterday, Denmark and
India announced an agreement of cultural exchange.
</DIV>
<DIV TYPE="sentence">
Further talks will take place next month.
</DIV>
</DIV>
To clearly distinguish tags, attributes and attribute values that are non-standard they should include an "X-" prefix and optionally an engine identifier.
<X-ME-PRON PHON="i" DUR="120">
where ME is "My Engine" and the X-ME-PRON element inserts an "i" phoneme with a duration of 120msec understood by "My Engine". (Because the PHON and DUR attributes are embedded in a non-standard element, they are implicitly non-standard attributes.)
<PRON X-ME-PHONES="ka:t">cat</X-ME-PRON>
or
<EMPH LEVEL="strong" X-PITCHACCENT="H*+L">word</EMPH>
The first example provides the pronunciation for "cat" in a format that is understood by "My Engine". Other synthesizers will ignore the attribute.
The second example includes both a standard attribute -- LEVEL -- and a non-standard attribute -- X-PITCHACCENT. A system that understands the non-standard attribute will apply the "H*+L" accent when producing string emphasis on "word".
<DIV TYPE="x-dialog-close">...</DIV>
The "x-dialog-close" is a non-standard value of the standard TYPE
attribute which is currently specified as being either "sentence"
or "paragraph". This non-standard value could indicate that the
contents of the element are the end of a dialog turn.
Wherever possible, non-standard tags and elements should be designed so
that output is not substantially impacted if ignored.
Non-Standard Attribute Value
A non-standard attribute value might look like:Notes
If an engine gets a non-standard tag, attribute or attribute value in
its input text that it does not know, it simply ignores it. For
example, in the X-ME-PHONES example, a synthesizer that ignores the tag
will try to say the word "cat".CHANGES
From Version 0.1
Unresolved from V0.1