SABLE: A Synthesis Markup Language (version 1.0)

General Preliminaries

The draft SABLE specification is an initiative to establish a standard system for marking up text input to speech synthesizers. The current draft (version 1.0) is being circulated for comment by users, developers and researchers of speech synthesis.

See also a paper describing SABLE, and a position paper describing the relation between SABLE and the W3C's Aural Cascaded Style Sheets.

Finally, see the Linguistic Data Consortium's web page on Linguistic Annotation.

The name SABLE is tentative. At some time it may change to ??ML for ? ? Markup Language.

Why SABLE?

Currently, speech synthesizers are controlled by a multitude of proprietary tag sets. These tag sets vary substantially across synthesizers and are an inhibitor to the adoption of speech synthesis technology by developers.

This SABLE markup language is being developed with the following goals in mind:

Ancestry

The SABLE specification evolved as an initiative to combine three existing speech synthesis markup languages:


Mark Attribute

In addition to the specified attributes, every SABLE tag allows for a MARK attribute, as defined below:
MARK Character-string identifier for this tag.
This attribute can be used to set an arbitrary mark at a given place in the text, so that, for example, an engine can report back to the calling application that it has reached the given location.

Sable Version 1.0 Tagset

The following tags are defined for Sable Version 1.0:

EMPH

Description: Set the emphasis of the contained text.

Attributes:

LEVEL Defines the level of emphasis. Either a floating point number or integer greater than or equal to 0.0; or a descriptive term, with the following numerical interpretations:
Strong 2.0
Moderate 1.0
None 0.5
Reduced 0.0
Default is Moderate.

Properties:

Example:

"The leaders of <EMPH>Denmark</EMPH> and <EMPH>India</EMPH> meet on Friday."


BREAK

Description: Sets an intrasentential, prosodic break at current position. (Contrasts with DIV which sets a text-structural break.)

Attributes:

LEVEL Defines the break level. Either a floating point number or integer; or a descriptive term, with the following numerical interpretations:
Large 3.0
Medium 2.0
Small 1.0
None 0.0
Default is Medium.
MSEC A floating point number or integer greater than or equal to zero defining the length of the pause associated with this break. Default is that appropriate for a break of the defined LEVEL.
TYPE A punctuation symbol that represents (roughly) the kind of intonation contour to be associated with the utterance preceding the BREAK: currently proposed values are "?" ("sounds like a question"); "!" ("sounds like an exclamation"); "." ("sounds like a statement"); "," ("sounds as if there is more coming").

Properties:

Example:

"Without style, <BREAK LEVEL="large"> Grace and I are in trouble."


PITCH

Description: Sets properties associated with pitch of the enclosed region.

Attributes:

BASE Sets the bottom, or "base" line of the intonation. A specification in one of the following formats:
  • A positive floating-point number representing an absolute Hz value
  • A percentage value higher or lower than the current. Thus, for N a floating point number, the following are legal specifications:
    N% N percent above current
    +N% N percent above current
    -N% N percent below current
  • A descriptive term:
highest highest possible value for engine/speaker/user
high reasonable high value for engine/speaker/user
medium reasonable medium value for engine/speaker/user
low reasonable base value for engine/speaker/user
lowest lowest possible value for engine/speaker/user
default reset to default value for engine/speaker/user
Default is 0% (no change).
MIDDLE Sets the middle, or "reference" line of the intonation. A specification in one of the following formats:
  • A positive floating-point number representing an absolute Hz value
  • A percentage value higher or lower than the current. Thus, for N a floating point number, the following are legal specifications:
    N% N percent above current
    +N% N percent above current
    -N% N percent below current
  • A descriptive term:
highest highest available value for engine/speaker/user
high reasonable high value for engine/speaker/user
medium reasonable medium value for engine/speaker/user
low reasonable base value for engine/speaker/user
lowest lowest available value for engine/speaker/user
default reset to default value for engine/speaker/user
Default is 0% (no change).
RANGE Sets the "pitch range" of the intonation. A specification in one of the following formats:
  • A positive floating-point number representing an absolute Hz value
  • A percentage value higher or lower than the current. Thus, for N a floating point number, the following are legal specifications:
    N% N percent above current
    +N% N percent above current
    -N% N percent below current
  • A descriptive term:
largest largest available value for engine/speaker/user
large reasonable large value for engine/speaker/user
medium reasonable medium value for engine/speaker/user
small reasonable small value for engine/speaker/user
smallest smallest available value for engine/speaker/user
default reset to default value for engine/speaker/user
Default is 0% (no change).

Properties:

Note that unlike some other markup schemes, there is no explicit tag for downstep or declination. These can, however, be implemented by appropriate resettings of BASE, MIDDLE and RANGE.

Example:

"Without his penguin, <PITCH BASE="-20%"> which he left at home, </PITCH> he could not enter the restaurant."


RATE

Description: Set the speech rate of the contained text.

Attributes:

SPEED Sets the speed. A specification in one of the following formats:
  • A positive floating-point number representing an absolute rate in words per minute
  • A percentage value higher or lower than the current. Thus, for N a floating point number, the following are legal specifications:
    N% N percent above current
    +N% N percent above current
    -N% N percent below current
    Default is +0% (no change).

  • A descriptive term:
fastest fastest available rate for engine/speaker/user
fast reasonable fast rate for engine/speaker/user
medium reasonable default rate for engine/speaker/user
slow reasonable slow rate for engine/speaker/user
slowest slowest available rate for engine/speaker/user

Properties:

The term words per minute is to be understood rather loosely, and is probably language-dependent in its interpretation. For English speakers who are used to thinking in terms of orthographic words (i.e., words that are defined by surrounding whitespace in a text), the normal notion of words-per-minute (commonly used to define rate of speech or rate of timing), should apply. For Japanese, where non-linguists are not used to thinking in terms of words, then another measure might be more appropriate: e.g. "bunsetsu-per-minute". The tag should really be interpreted as something like LANGUAGE-APPROPRIATE-MINIMAL-INDEPENDENT-UNIT-PER-MINUTE. For reasons that should be obvious this would, alas, not make a very good tag name.

Example:

"The address is <RATE SPEED="-20%"> 10 Main Street </RATE>."


VOLUME

Description: Set the volume of the contained text.

Attributes:

LEVEL Defines the amplitude level. A specification in one of the following formats:
  • A floating-point number between 0 (= silence) and 1 (maximum volume for the engine)
  • A percentage value higher or lower than the current. Thus, for N a floating point number, the following are legal specifications:
    N% N percent above current
    +N% N percent above current
    -N% N percent below current
  • A descriptive term:
loudest loudest available volume for engine
loud reasonable loud volume for engine
medium reasonable medium volume for engine
quiet quietest audible volume for engine

Default is medium.

Properties:

This tag sets only the volume. Associated phonation changes are not implied. Thus quiet is not a whisper.

Example:

"Please speak more <VOLUME LEVEL="loud">loudly</VOLUME>."


AUDIO

Description: Load and play an audio URL starting at the given point

Attributes:

SRC URL of a document with an appropriate mime-type
MODE Either of the following:
background play as background to speech from this point on
insertion play at this point, and when finished resume speaking
Default is insertion.
LEVEL A floating point number above 0.0. 1.0 is the same level as the original audio, 0.0 is silent. If not specified, the engine should scale the SRC's amplitude to be approximately that of the surrounding speech.

Properties:

AUDIO is not a required tag in a SABLE-conformant system: it is recognized that not all engines/systems may be able to support it. Furthermore, it is acceptable if a system supports some audio types (e.g. .au, .aiff), but not others (e.g.   .wav, real audio).

Example:

"Beethoven <AUDIO SRC="5th.au"> and Tchaikovsky <AUDIO SRC="1812.wav"> wrote good music!"


ENGINE

Description: Substitute the DATA for the contained text if the system happens to be using the engine specified by ENGINE

Attributes:

ID Identifier for the specific TTS engine.
DATA Any character string to be substituted for the contained text.

Properties:

The ENGINE tag allows one to select a specific text to be substituted for the contained text for a given synthesizer, if one happens to be using that synthesizer to read the given SABLE document. It also serves as a way to pass engine-specific controls to a given engine: this can be implemented by using the ENGINE tag to enclose empty text, and having the DATA be the control string. Engines other than the one specified by ID are free to ignore this tag, or may attempt to interpret it if they think they are able to.

Example:

"The <ENGINE ID="acme synth" DATA="wonderful, fantastic acme synthesizer"> Acme synthesizer</ENGINE>."

On an Acme system it says "wonderful, fantastic acme synthesizer". On other systems, it says just "Acme synthesizer".


MARKER

Description: serves as an anchor point for a MARK that is not otherwise associated with another tag.

Attributes:

MARK Character-string identifier for this tag.

Properties:

MARK is an attribute of any SABLE tag. However, there may be instances where one wants to set a MARK, but where no specific tag is appropriate. MARKER should be used in such instances.

Example:

"Move the <MARKER MARK="mouse"> mouse to the top."


SABLE

Description: Identifies the current document as being a SABLE document.

No Attributes.

Properties:

Example:

<SABLE>
The text to be spoken goes here.
It might include special tags.
</SABLE>


PRON

Description: Substitute the given pronunciation for the pronunciation that would normally be computed for the contained text.

Attributes:

IPA Character string in Unicode IPA describing the pronunciation to be used for the contained text.
SUB Character string representing an attempt at "phonetic" spelling (in the language of the enclosing text) for the contained text.
ORIGIN Identifier for the language of origin of the enclosed text, following the iso639 scheme.

Properties:

The IPA attribute is provided to allow for a precise phonetic rendering for the contained text. Recognizing that many developers may not be experienced with IPA, or other formal phonetic transcription schemes, SABLE provides an alternative method for specifying the desired pronunciation, using the SUB tag. Using this tag, an application may substitute for the contained text an attempt at phonetically spelling the text. Thus, one might want to specify the British pronunciation of "tomato" as follows:

<PRON SUB="tomahto">tomato</PRON>

Needless to say, it will depend upon the engine being used whether this will actually result in an appropriate pronunciation. This is unavoidable, and developers who desire both precision and portability should make the effort to learn IPA.

For languages that have a conventional, or semi-conventional "phonetic" writing scheme, in addition to, or part of their normal orthography, the SUB attribute is an appropriate way to include intended pronunciations transcribed in that scheme. For example, if one wants to specify the exact pronunciation of a Japanese personal name that is normally written in kanji (Chinese characters), one could specify its pronunciation with the SUB attribute of PRON, using a transcription in kana. Similarly for Korean, if one is using older-style mixed Korean text with Chinese characters, one might specify the pronunciation using the SUB attribute using hankul. For Mandarin Chinese, one might use either pinyin or zhuyin fuhao (Mandarin phonetic symbol set) in that field, though it is likely to be engine specific which one of these will be supported.

ORIGIN may be used to specify that the enclosed text comes from a particular language, and may be pronounced accordingly by the engine:

This is all rather <PRON ORIGIN=fr>passe</PRON>

If both IPA and SUB are specified, IPA takes precedence. PRON instances that specify no attribute are ignored.


SAYAS

Description: Defines a way in which the contained region is to be said.

Attributes:

MODE Mode in which to say the contained text. Currently supported values are:
literal Literally read the string of characters in contained text, as appropriate to the language. E.g. "spelling out" in English, or character-by-character descriptions in Chinese.
date Contained region is to be read as a date.
time Contained region is to be read as a time.
phone Contained region is to be read as a phone number.
net Contained region is an internet address or handle (URL or e-mail address)
postal Contained region is a postal address.
currency Contained region is a currency amount.
math Contained region is a mathematical expression.
fraction Contained region is a fraction.
measure Contained region is a measurement (e.g. 1km).
ordinal Contained region is an ordinal number.
cardinal Contained region is a cardinal number.
name Contained region is a proper name.
MODETYPE Secondary specification further qualifying MODE. The following values are defined for the given MODE value:
DMY date is in Day-Month-Year format.
MDY date is in Month-Day-Year format.
YMD date is in Year-Month-Day format.
YM date is in Year-Month format.
MY date is in Month-Year format.
MD date is in Month-Day format.
HM time is in Hour-Minute format.
HMS time is in Hour-Minute-Second format.
EMAIL net is an e-mail address.
URL net is a URL.

Properties:

SAYAS instances with no specified MODE are ignored. A date without a specified MODETYPE interpreted as best as the engine can, possibly taking account of the locale (e.g. MDY in USA, DMY in most other countries).

Example:

At <SAYAS MODE="time">2pm</SAYAS> on <SAYAS MODE="date" MODETYPE="YM"> 98/3</SAYAS> Mike will send <SAYAS MODE="currency">$4000</SAYAS> to <SAYAS MODE="net" MODETYPE="email">me@acme.com</SAYAS>.


LANGUAGE

Description: Specify the language of the contained text.

Attributes:

ID Identifier for the desired language, following the iso639 scheme, or dialect following the RFC1766.
CODE Optional id for the encoding scheme used for the language.

Properties:

LANGUAGE can tag a region of any size. However, for most applications, and for most TTS systems, it will not be desirable to switch languages within a sentence.

Unless the SPEAKER is also specified, changing to a new language will result in using the default speaker for that language.

LANGUAGE instances without an associated ID specification will be ignored.

If the CODE attribute is not specified then it should default to the engine-specific default for the given language. In most cases it should not be necessary to specify it. The main instances in which it would be used are in cases where there is more than one viable option for encoding the script for the language. For example Chinese could be coded in either GB or Big5.

Example:

<LANGUAGE ID="en">Some text in English.</LANGUAGE>
<LANGUAGE ID="de">Ein deutscher Satz.</LANGUAGE>


SPEAKER

Description: Specify the speaker to use for the contained text.

Attributes:

GENDER Gender for the desired speaker. Values are:
male Male speaker
female Female speaker

Default is the default gender for the engine.

AGE Description of the age of the desired speaker. Values are:
older Older speaker
middle Middle aged speaker
younger Young adult
teen Teenager
child Child

Default is the default "age" for the engine.

NAME Name of a speaker if a particular engine is being used.

Properties:

If NAME is specified, then it (may) override the other specifications of AGE and GENDER. If the system does not have a speaker with the given NAME, then the GENDER and AGE specifications (or their defaults) are used.

Example:

<SPEAKER GENDER="male" AGE="child">I'm a young boy!</SPEAKER>


DIV

Description: Classifies the contained region as a division of type TYPE.

Attributes:

TYPE Type of the division. Currently allowed values are:
sentence Sentence
paragraph Paragraph

Properties:

Currently recommended types are only SENTENCE and PARAGRAPH. However, it is intended that DIV be used to support any reasonable division within a text. For instance, in the relatively unlikely event of a TTS system reading poetry, <DIV TYPE=line> and <DIV TYPE=stanza> might be reasonable. Similarly, in a transcription of a dialogue, DIV tags marking turn-taking might also be desirable. Indeed, such extensions to the values for TYPE are legal SABLE, though one cannot of course expect portability unless the TYPE is explicitly defined for the standard.

DIV instances with no specified TYPE are ignored.

Example:

<DIV TYPE="paragraph">
<DIV TYPE="sentence" >
Yesterday, Denmark and India announced an agreement of cultural exchange. </DIV>
<DIV TYPE="sentence">
Further talks will take place next month.
</DIV>
</DIV>


Non-Standard Extensions to SABLE

SABLE is designed to function as a well-defined standard in which the same text will be handled consistently by multiple synthesizers. SABLE is also intended to function as a tool for research on speech synthesis and as a tool for innovation. As such, it is expected that research systems will support tags, attributes and attribute values not defined in the SABLE specification, and that SABLE text will be generated for specific systems which includes those tags and attributes. Where such extensions prove useful and become generally supported they can be proposed as an addition to the standard specification.

To clearly distinguish tags, attributes and attribute values that are non-standard they should include an "X-" prefix and optionally an engine identifier.

Non-Standard Tag

A non-standard tag for providing an engine-specific pronunciation string could look like:

<X-ME-PRON PHON="i" DUR="120">

where ME is "My Engine" and the X-ME-PRON element inserts an "i" phoneme with a duration of 120msec understood by "My Engine". (Because the PHON and DUR attributes are embedded in a non-standard element, they are implicitly non-standard attributes.)

Non-Standard Attribute

A non-standard attribute of a standard tag might look like:

<PRON X-ME-PHONES="ka:t">cat</X-ME-PRON>

or

<EMPH LEVEL="strong" X-PITCHACCENT="H*+L">word</EMPH>

The first example provides the pronunciation for "cat" in a format that is understood by "My Engine". Other synthesizers will ignore the attribute.

The second example includes both a standard attribute -- LEVEL -- and a non-standard attribute -- X-PITCHACCENT. A system that understands the non-standard attribute will apply the "H*+L" accent when producing string emphasis on "word".

Non-Standard Attribute Value

A non-standard attribute value might look like:

<DIV TYPE="x-dialog-close">...</DIV>

The "x-dialog-close" is a non-standard value of the standard TYPE attribute which is currently specified as being either "sentence" or "paragraph". This non-standard value could indicate that the contents of the element are the end of a dialog turn.

Notes

If an engine gets a non-standard tag, attribute or attribute value in its input text that it does not know, it simply ignores it. For example, in the X-ME-PHONES example, a synthesizer that ignores the tag will try to say the word "cat".

Wherever possible, non-standard tags and elements should be designed so that output is not substantially impacted if ignored.


CHANGES


From Version 0.1

Unresolved from V0.1