Note
Please see Azure Cognitive Services for Speech documentation for the latest supported speech solutions.
Speech Synthesis Markup Language Reference (Microsoft.Speech)
Speech Synthesis Markup Language (SSML) is an XML-based markup language that application developers use to control various characteristics of synthetic speech (text-to-speech, or TTS) output including voice, pitch, rate, volume, pronunciation, and other characteristics.
The Microsoft implementation of SSML is based on World Wide Web Consortium Speech Synthesis Markup Language (SSML) Version 1.0.
All SSML elements belong to the ssml namespace. The following elements are implemented in the Microsoft Speech Platform SDK 11.
SSML Element |
Description |
Usage |
Attributes |
---|---|---|---|
audio |
Supports the insertion of recorded audio files. |
Optional |
src |
break |
An empty element used to control the prosodic boundaries between words. |
Optional |
strength, time |
emphasis |
Increases the level of stress with which the contained text is spoken. |
Optional |
level |
lexicon |
Specifies a lexicon document that contains the pronunciations for the content of the document. |
Optional |
uri, type |
mark |
Designates a specific reference point in the text sequence. This element can also be used to mark an output audio stream for asynchronous notification. |
Optional |
name |
p and s |
Denote the paragraph and sentence structure of the document. |
Optional |
xml:lang |
phoneme |
Indicates the phonetic pronunciation for the contained text. Overrides the pronunciations in the lexicon, if one is specified. |
Optional |
ph, alphabet |
prosody |
Controls the pitch, rate, and volume of the speech output. |
Optional |
pitch, contour, range, rate, duration, volume |
say-as |
Indicates the type of text contained in the element (such as acronym, number, and date). |
Optional |
interpret-as, format, detail |
speak |
The required root element for all SSML documents. |
Required |
version, xmlns, xml:lang |
sub |
Specifies a string of text that should be pronounced in place of the text contained in the element. |
Optional |
alias |
voice |
Specifies a voice and its attributes, to be used for synthesized speech, often used to change from one voice to another. |
Optional |
xml:lang, gender, age, variant, name |