Note

Please see Azure Cognitive Services for Speech documentation for the latest supported speech solutions.

Microsoft Speech Platform

About Lexicons and Phonetic Alphabets

Lexicons contain the mapping between the written representations and the pronunciations of words or short phrases. Speech recognition engines and speech synthesis engines each have an internal lexicon that specifies which words in a language can be recognized or spoken. The lexicon specifies how the engine expects a word to be pronounced using characters from a single phonetic alphabet.

A phonetic alphabet contains combinations of letters, numbers, and characters which are known as "phones". Phones describe the spoken sounds of one or more human languages, and represent the valid set of tokens that can be used to define the pronunciations of words using phonetic spellings. Similar to those used in dictionaries, phonetic spellings in lexicons describe how words should be pronounced for speech recognition or for speech synthesis.

For example, here is a phonetic spelling that a speech engine can use to recognize or speak the word "hello" in US English:

-

H EH . S1 L O

This example uses phones from the Universal Phone Set (UPS). The letters indicate the spoken segments that make up the word. The "S1" is a stress symbol that indicates which syllable should be accented. The phones are space-delimited, and the "." separates the syllables in the word.

Note: The Speech Platform supports three phonetic alphabets. See Phonetic Alphabet Reference (Microsoft.Speech) for descriptions and lexicon Element PLS (Microsoft.Speech) for how to specify a phonetic alphabet.

A speech recognition engine listens for pronunciations of words that correspond to phonetic spellings that are specified in its internal lexicon. A speech synthesis engine pronounces words as specified by the phonetic spellings in its internal lexicon. A speech engine can also create pronunciations on-the-fly for words it encounters that are not included its lexicon. A speech engine includes an acoustic model and a language model. The acoustic model describes the sounds of a language, while the language model describes how words are distributed in spoken language. The speech engine's language model and acoustic model enable it to process spoken variations of the pronunciations specified in its lexicon, as well as new words.

Application lexicons

You can supplement a speech engine's default lexicon by creating an application-specific lexicon to improve the accuracy of speech recognition or to customize the vocabulary and pronunciations of a synthesized voice. This is often not necessary because a speech engine can find and create pronunciations for both common and uncommon words in a language.

However, if your application includes words that feature unusual spelling or atypical pronunciation of familiar spellings, then the speech engine may not create the pronunciation that works best for your application. In these cases, you can specify a custom pronunciation that may improve the recognition accuracy for the specialized vocabulary in your application. For example you can create an application lexicon that does the following:

  • Contains words that are not included in the default lexicon and specifies their pronunciations. For example, you can add proper nouns, such as place names and business names, or words that are specific to specialized areas of business, education, or medicine.

  • Specifies new pronunciations that replace the pronunciations defined for words that are already included in the default lexicon. For example, you can add pronunciations that capture regional variations of languages, dialects, slang, and common mispronunciations. You can specify multiple phonetic spellings (pronunciations) for a word. The speech recognition engine accepts all pronunciations of a word in a lexicon; the speech synthesis engine chooses and speaks only one of the pronunciations for a word in a lexicon. See the prefer attribute of the phoneme Element PLS (Microsoft.Speech).

The words and pronunciations that you create in application-specific lexicons take precedence over the pronunciations in a speech engine's default lexicon. Pronunciations specified inline in speech recognition grammar documents and in speech synthesis prompt documents and take precedence over pronunciations in both application-specific lexicons and default lexicons. See Inline Pronunciations later in this topic.

Multiple lexicons may be active simultaneously for an application. For example, if an application references multiple grammars that each reference a different lexicon, then the application has access to all the words and pronunciations contained in all the referenced lexicons in addition to the speech engine's default lexicon.

In the Speech Platform, each grammar document can reference only one lexicon. This is a departure from the Speech Recognition Grammar Specification (SRGS) Version 1.0, see SRGS Grammar XML Reference (Microsoft.Speech).

Authoring lexicons

You author lexicons as XML documents that follow the format of the Pronunciation Lexicon Specification (PLS) Version 1.0. Lexicon documents must declare the single language-culture of the words in the lexicon and the phonetic alphabet used to construct the pronunciations. See Pronunciation Lexicon Reference (Microsoft.Speech) for a summary reference of the PLS specification and for information that is specific to Microsoft's implementation of PLS.

Here is an example of a short lexicon that contains two spellings of a word and two pronunciations, one of which is designated as preferred. Note the declaration of the phonetic alphabet used to create the pronunciations, given in the alphabet attribute of the lexicon Element PLS (Microsoft.Speech).

`

<?xml version="1.0" encoding="UTF-8"?>
<lexicon version="1.0"
xmlns="http://www.w3.org/2005/01/pronunciation-lexicon"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2005/01/pronunciation-lexicon
http://www.w3.org/TR/2007/CR-pronunciation-lexicon-20071212/pls.xsd"
alphabet="x-microsoft-sapi" xml:lang="en-US">

<lexeme> <grapheme> theater </grapheme> <grapheme> theatre </grapheme> <phone prefer="true"> 1 th iy . ax . t ax r </phone> <phone> th iy . 1 eh . t ax r </phone> </lexeme>

</lexicon>

`

Specifying lexicons

After you have authored a lexicon, you must reference it from a speech recognition grammar document or a speech synthesis prompt document to use its words and pronunciations. An application-specific lexicon is only active while the grammar or prompt document that references it is active.

Lexicons are reusable. For example, you can reference a lexicon that contains specialized medical terms from multiple grammars or prompts in an application, or from multiple applications. You can use the same lexicon for both speech recognition and speech synthesis.

Inline pronunciations

In addition to creating custom pronunciations in lexicon documents, you can also specify custom pronunciations inline in speech recognition grammar documents and in speech synthesis prompt documents. Again, speech engines are quite adept at creating pronunciations for uncommon words, so be sure to test your custom pronunciations to verify that they improve the speech experience for users of your application.

Inline pronunciations apply only for the single instance of the element in which they are specified.