Introducing Computer Speech Technology

  Microsoft Speech Technologies Homepage

In the mid to late 1990s, personal computers started to become powerful enough to enable users to speak to them and for the computers to speak back. While speech technology is still far from delivering natural, unstructured conversations with computers, it currently is delivering some very real benefits in real applications. For example:

  • Many large companies have started adding speech recognition to their Interactive Voice Response (IVR) systems. Just by phoning a number and speaking, users can buy and sell stocks from a brokerage firm, check flight information with an airline company, or order goods from a retail store. The systems respond using a combination of prerecorded prompts and an artificially generated voice.
  • Microsoft Office XP (Office XP) users in the United States, Japan, and China can dictate text to Microsoft Word or PowerPoint documents. Users can also dictate commands and manipulate menus by speaking. For many users, particularly speakers of Japanese and Chinese, dictating is far quicker and easier than using a keyboard. Office XP can speak back too. For example, Microsoft Excel can read text back to the user as the user enters it into cells, saving the trouble of checking back and forth from screen to paper.

The two key underlying technologies behind speech-enabling computer applications are speech recognition (SR) and speech synthesis. These technologies are introduced in the following sections.

  • Introduction to Computer Speech Recognition
  • Introduction to Computer Speech Synthesis

Introduction to Computer Speech Recognition

Speech recognition (SR) is the process of converting spoken language into printed text. Speech recognition, also called speech-to-text recognition, involves:

  1. Capturing and digitizing the sound waves produced by a human speaker.
  2. Converting the digitized sound waves into basic units of language sounds or phonemes.
  3. Constructing words from the phonemes.
  4. Analyzing the context in which the words appear to ensure correct spelling for words that sound alike (such as write and right).

The figure below illustrates a general overview of the process.

Recognizers (also known as speech recognition engines) are the software drivers that convert the acoustical signal to a digital signal and deliver recognized speech as text to an application. Most recognizers support continuous speech recognition, meaning that users can speak naturally into a microphone at the speed of most conversations. Isolated or discrete speech recognizers require the user to pause after each word, and are currently being replaced by continuous speech engines.

Continuous speech recognition engines currently support two modes of speech recognition:

  • Dictation, in which the user enters data by reading directly to the computer.
  • Command and control, in which the user initiates actions by speaking commands or asking questions.

Using dictation mode, users can dictate memos, letters, and e-mail messages, as well as enter data. The size of the recognizer's grammar limits the possibilities of what can be recognized. Most recognizers that support dictation mode are speaker-dependent, meaning that accuracy varies depending on the user's speaking patterns and accent. To ensure the most accurate recognition, the application must create or access a speaker profile that contains information about the user's speech patterns.

Using command and control mode, users can speak commands that control the functions of an application. Implementing command and control mode is the easiest way for developers to integrate a speech interface into an existing application because developers can limit the content of the recognition grammar to the available commands. This limitation has several advantages:

  • It produces better accuracy and performance rates compared to dictation tasks, because a speech recognition engine used for dictation must encompass nearly an entire language dictionary.
  • It reduces the processing overhead that the application requires.
  • It also enables speaker-independent processing, eliminating the need for speaker profiles or "training" of the recognizer.

Speech Recognition Using the Microsoft Speech Application Platform

Speech recognition using the Microsoft Speech Application Platform (Speech Platform) is a process with two distinct phases.

The first phase involves recording the user's speech and delivering the audio recording and an application grammar to an SR engine, which converts the speech in the audio recording to text as described earlier. The route through which the audio and grammar are delivered to the speech recognition engine differs slightly depending on the client that calls the speech application. In a Telephony Scenario and in a Windows Mobile-based Pocket PC 2003 (Pocket PC) Multimodal Scenario, the audio and grammar are sent to the Speech Engine Services (SES) component of Microsoft Speech Server (MSS). SES loads the application grammar, and then sends the audio and grammar to the Speech API (SAPI). SAPI parses the grammar into the appropriate rules, properties and phrases, and passes the parsed grammar and audio to an available SR engine, which performs the actual recognition work. In a Desktop Multimodal Scenario, the Speech Add-in for Microsoft Internet Explorer (Speech Add-in) loads the grammar, and then instantiates a shared SAPI SR engine. The Speech Add-in passes the grammar and audio to SAPI, SAPI parses the grammar, and passes the parsed grammar and audio to the SR engine, which then performs the recognition.

The second phase involves semantic analysis of the resulting recognition text in order to determine its meaning. The recognizer iteratively compares the recognition text to the rules in the application's grammar. When the recognizer matches recognized text to a series of rules in a grammar, the recognizer produces an XML output stream, using Semantic Markup Language (SML), to represent semantic output. The semantic output contains recognition confidence values, recognized text, and can also contain semantic values that the developer assigns using semantic interpretation markup. Developers use the information in the SML output to infer the meaning of what the user said.

A Simple Example Grammar

The following grammar is a simple English-language grammar for vehicle trading. The root rule, called ruleVTrade, defines the structure of the sentence or phrase that can be recognized using this grammar. It defines optional words and phrases like "I want" and "please," and references two other grammar rules. The first referenced rule is called ruleAction, and it accepts words like "buy" and "sell." The second rule, ruleVehicle, accepts the words "car," "auto" and "truck."

<grammar root="ruleVTrade" version="1.0" xmlns="http://www.w3.org/2001/06/grammar"
 xml:lang="en-US" tag-format="semantics-ms/1.0">
 
    <rule id="ruleVTrade" scope="public">
        <item repeat="0-1">  I want to  </item>
        <item repeat="0-1">  I would like to  </item>
        <item repeat="0-1">
            <ruleref uri="#ruleAction"/> <tag> $.Action = $$ </tag>
        </item>
        <item repeat="0-1"> a  </item>
        <item repeat="0-1"> an  </item>
        <item repeat="0-1">
            <ruleref uri="#ruleVehicle" /> <tag> $.Vehicle = $$ </tag>
        </item>
        <item repeat="0-1">  please  </item>
    </rule>
    
    <rule id="ruleAction">
        <one-of>
            <item>  buy  <tag> $._value = "BUY" </tag> </item>
            <item>  sell  <tag> $._value = "SELL" </tag> </item>
        </one-of>
    </rule>
    
    <rule id="ruleVehicle">
        <one-of>
            <item>  car  <tag> $._value = "AUTO" </tag> </item>
            <item>  auto  <tag> $._value = "AUTO" </tag> </item>
            <item>  truck  <tag> $._value = "TRUCK" </tag> </item>
        </one-of>
    </rule>
    
</grammar>

SML Recognition Results

The following code blocks illustrate the results from the successful recognition of the phrases "I want to buy a truck" and "Sell a car" when using the preceding grammar. The recognizer uses the information generated by the root rule to create the top-level SML node. Each rule that is contained by the root rule, and that the recognizer matches to the user's speech, produces an XML node in the Semantic Markup Language (SML) output.

<SML confidence="0.769" text="I want to buy a truck" utteranceConfidence="0.769">
    <Action confidence="0.873">BUY</Action> 
    <Vehicle confidence="0.876">TRUCK</Vehicle> 
</SML>

<SML confidence="0.802" text="sell a car" utteranceConfidence="0.802">
    <Action confidence="0.872">SELL</Action> 
    <Vehicle confidence="0.889">AUTO</Vehicle> 
</SML>

Note  In the preceding grammar, the ruleVehicle returns the semantic value AUTO when the user says either auto or car.

Valuable Features that Speech Recognition Adds to Applications

Using speech recognition technology, developers can produce applications that feature:

  • Hands-free computing, either as an alternative to the keyboard, or to enable users to use the application in environments where a keyboard is impractical (as with small mobile devices, AutoPCs, or mobile phones).
  • A more human-like computer interface, making educational and entertainment applications seem more friendly and realistic.
  • Voice responses to message boxes and wizard screens.
  • Streamlined access to application controls and large lists, enabling a user to speak any item from a list, or any command from a large set of commands without navigating through several dialog boxes or cascading menus.
  • Context-sensitive dialogues between the user and the computer in which the computer's response depends on the user's input. For example, imagine a travel services application asking a user, "What do you want to do?" and the user replying, "I want to book a flight from New York to Boston." In this case, the application could respond by asking whether the user wants to book a flight departing from La Guardia airport, or JFK airport. After receiving this information, the application could continue by asking what day and time the user wants to leave. If the user wants to book a flight departing from a city with only one airport, the application would not need to clarify the departure airport, and could move immediately to asking about a departure day and time.

Potential Applications for Speech Recognition

The specific uses of speech recognition technology depend on the application type. Some of the types of applications that are good candidates for implementing speech recognition include:

  • Telephony
    Speech recognition plays a critical role in telephony applications. Many telephony applications require users to press telephone keypad numbers in order to select from a list of items. Pressing keypad numbers can be inconvenient for users with cell telephones or handset telephones. Speech recognition can make placing orders, selecting items, and providing other information over the phone more natural.
  • Data Entry
    Speech recognition can significantly improve the speed of data entry of numbers and items selected from a small list (less then 100 items). Some recognizers can even handle spelling fairly well. If a spreadsheet or database application has fields with mutually exclusive data types (for example, one field that recognizes male or female, another that recognizes age, and a third that recognizes a city name), developers can program the application to automatically populate the correct data fields in the spreadsheet or database if the user provides all of the requested pieces of information in a single utterance.
  • Games and Edutainment
    Speech recognition enhances the realism and fun in many computer game and edutainment applications by enabling users to talk to on-screen characters as if they were talking to another person.
  • Document Editing
    In dictation mode, speech recognition can enable users to dictate entire documents into a word processor without typing. In command and control mode, speech recognition can enable users to modify document formatting or change views without using the mouse or keyboard. For example, a word processing application can provide commands like "bold," "italic," "change to Times New Roman font," "use bullet list text style," and "use 18 point type," while a graphic manipulation application can provide commands like "select eraser" or "choose a wider brush."

Introduction to Speech Synthesis

Speech synthesis is the process of generating sound that resembles spoken human language. Two components of the Microsoft Speech Platform are responsible for producing spoken language output:

  • The Speech Platform Prompt Engine is a voice response component that generates output by concatenating pre-recorded words and phrases from a prompt database.
  • The Speech Platform TTS Engine is a voice synthesis component that generates output by synthesizing words and phrases.

The Speech Platform Prompt Engine

The prompt engine is the component of the Speech Platform that takes text input and produces speech output by concatenating recordings of words and phrases that match the text input. The prompt engine stores the recordings it uses on disk and indexes them in one or more prompt database files.

Requests for taking text input and producing speech output are created by functions in the speech application code. A prompt engine request normally includes prompt engine markup language (PEML) markup to specify the database or databases that the prompt engine should use, and the text input for which the prompt engine should produce speech output. The prompt engine can search multiple prompt databases simultaneously. Databases are stored on disk, and consist of an index of available word and phrase segments and the associated audio data.

When text is sent to the prompt engine, the prompt engine searches the specified databases for prerecorded segments that match the text. Note that what the prompt engine actually searches are the indices of the databases loaded into memory, and that the audio data itself is read from disk as needed. See Prompt Engine Markup Language for more information on prompt engine database searches.

If the Speech Platform prompt engine is unable to construct a segment using the set of recorded words and phrases in the prompt database, the prompt engine creates a "fallback" text-to-speech (TTS) engine.

The Speech Platform TTS Engine

The TTS engine (also referred to as a text-to-speech voice) is the component that synthesizes speech output by:

  1. Breaking down the words of the text into phonemes.
  2. Analyzing the input for occurrences of text that require conversion to symbols, such as numbers, currency amounts, and punctuation (a process known as text normalization, or TN).
  3. Generating the digital audio for playback.

The following figure illustrates a general overview of the process.

TTS engines in general can use one of two techniques:

  • Formant TTS
  • Concatenative TTS

Using linguistic rules and models, a formant TTS engine generates artificial sounds similar to those created by human vocal cords and applies various filters to simulate throat length, mouth cavity shape, lip shape, and tongue position. Although format TTS engines can produce highly intelligible speech output, the output still has a "computer accent".

A concatenative TTS engine also uses linguistic rules to formulate output, but instead of generating artificial sounds, the engine produces output by concatenating recordings of units of real human speech. These units are usually phonemes or syllables that have been extracted from larger units of recorded speech, but may also include words or phrases.

The Speech Platform TTS engine utilizes the concatenative technique. Although the speech output produced by the concatenative technique sounds more natural than the speech output produced by the formant technique, it still tends to sound less human than an individual, continuous recording of the same speech output produced by a human speaker. Nevertheless, text-to-speech synthesis can be the better alternative in situations where preparing individual audio recordings of every individual prompt required for an application is inadequate or impractical.

Generally, developers consider using TTS when:

  • Audio recordings are too large to store on disk, or are prohibitively expensive to record.
  • The developer cannot predict what responses users will require from the application (such as requests to read e-mail over the telephone)
  • The number of alternate responses required makes recording and storing prompts unmanageable or prohibitively expensive.
  • The user prefers or requires audible feedback or notification from the application. For example, a user may prefer to use TTS to perform audible proofreading of text and numbers in order to catch typographical errors missed by visual proofreading.

Potential Applications for Text-to-Speech Synthesis

The specific uses of TTS technology depend on the application type. Some of the applications that are good candidates for implementing TTS include:

  • Telephony
    Text-to-speech synthesis plays a critical role in telephony applications. Because telephony applications have no visual interface, using TTS is a valuable method for verifying customer selections. TTS can also be the preferred method for delivering information that users request, especially when the range of the information requested derives from a large set of possible values. Stock price information is a good example. Although it is possible to record all of the numbers that are likely to be part of a stock price, doing so is time-consuming and expensive, making TTS a faster, less expensive alternative.
  • Data Entry
    Developers can implement text-to-speech reading of data values as users enter data in a spreadsheet or database application as a means of verifying correct entry of information that can be tedious to check, such as phone numbers, addresses, monetary values, and identification numbers.
  • Games and Edutainment
    Text-to-speech synthesis allows the characters in an application to talk to the user instead of merely displaying speech balloons. Even if it is possible to use digital recordings of speech, using TTS instead of recordings can be preferable in certain cases. For example, the less-than-human quality of artificially synthesized speech makes it ideal for characters whose voices are robotic or alien.

In addition, TTS can be generally useful for application prototyping. In some cases TTS may even be the only practical alternative. For example, if application development schedules do not allow enough time to record all of the prompts that an application requires, the application developer may have no alternative to using TTS.

See Also

Speech Application Platform Glossary