The Speech Application SDK and the Speech Application Platform

  Microsoft Speech Technologies Homepage

The Microsoft Speech Application Platform provides development and deployment environments for speech-enabled Web applications. The Speech Platform consists of two major components:

  • Microsoft Speech Application SDK Version 1.1 (SASDK)
  • Microsoft Speech Server 2004 (MSS)

Microsoft Speech Application SDK

The SASDK is the component of the Speech Platform that enables developers to create and debug multimodal and voice-only applications. The SASDK includes:

  • A set of ASP.NET controls (Speech Controls) that enable speech input and output in ASP.NET applications by generating HTML + Speech Application Language Tags (SALT) markup for voice-only (for example, telephony) and multimodal browsers.

  • A suite of Visual Studio .NET 2003 add-on tools that developers use to speech-enable an application with few speech-specific knowledge requirements:

    • Speech Control Editor for creating and editing Speech Controls
    • Speech Grammar Editor for creating and editing speech recognition grammars
    • Speech Prompt Editor for creating and editing prerecorded voice output
  • Speech debugging tools, including Telephony Application Simulator, Speech Debugging Console, and Speech Debugging Console Log Player

  • Speech Add-in for Microsoft Internet Explorer that allows developers to run and test their speech-enabled Web applications.

  • A set of sample applications that demonstrate specific tasks and components, and reference applications that demonstrate full application implementation.

  • A rich grammar library.

  • A step-by-step tutorial to building a sample application using Speech Controls, grammars, and prompts.

Desktop applications, both multimodal and voice-only, require a workstation running a Microsoft Windows XP or Windows Server 2003 operating system with Internet Explorer 6.0 and the Microsoft Speech Add-in for Microsoft Internet Explorer. Telephony applications can be accessed using Telephony Application Simulator, which enables developers to simulate running the application in the MSS environment.

An ASP.NET Web server runs the speech-enabled Web applications. Speech recognition, prompting, and synthesis are performed by various components, depending on the type of client that runs the speech-enabled application. Applications running on desktop computers use locally installed speech engines that comply with the Speech Application Programming Interface (SAPI) version 5.x.

Microsoft Speech Server

MSS is the server-based infrastructure that deploys and runs distributed speech-enabled Web applications. MSS provides scalable, secure, and manageable speech-processing services that run on the Microsoft Windows Server 2003 family operating system. MSS also enables deployment of both telephony applications and multimodal applications on Windows Mobile-based Pocket PC 2003 (Pocket PC) devices. MSS includes:

  • Telephony Application Services, which is a telephony SALT interpreter.
  • Speech Engine Services (SES) that provides speech recognition and speech output resources.
  • Telephony Interface Manager software that enables the installed telephony board to communicate with MSS.
  • Speech Add-in for Microsoft Pocket Internet Explorer that allows developers to run and test their speech-enabled applications on Pocket PC.

Pocket PC applications, both multimodal and voice-only, require a Pocket PC device running Windows Mobile Edition 2003 with Pocket Internet Explorer and the Microsoft Speech Add-in for Microsoft Pocket Internet Explorer. Applications running on Pocket PC devices or telephony clients use remote recognition, which is managed by SES.

Note  To run speech-enabled Web pages on Pocket Internet Explorer, the Pocket PC device must be connected to a 802.11b network or better, and the speech-enabled Web application must be configured to process speech remotely on a server running SES.

For more information about obtaining MSS, see the Microsoft Speech Technologies Web site.

Roles of Speech Platform Constituents

The following table lists the major constituents of the SASDK component of the Speech Platform, and summarizes what they do.

Constituent What it does
Speech Controls Speech Controls expose a higher-level API than SALT, abstracting a dialogue model and providing ASP.NET with a macro language for authoring SALT applications. On a Web page built with Speech Controls, the Web server translates each Speech Control tag into client-side SALT, as well as the necessary script to tie the application together. Multimodal browsers receive a full set of visual HTML plus additional SALT and script. Telephony SALT interpreters are sent a subset of HTML, including forms and input fields, SALT and script for managing voice-only dialogue flow.
Visual Studio .NET 2003 Authoring Tools The SASDK provides speech-enabled Web application authoring tools for use within the Visual Studio .NET 2003 development environment: Speech Grammar Editor, Speech Prompt Editor, and Speech Control Editor.
Speech Debugging Tools The Speech Debugging Tools include Telephony Application Simulator, Speech Debugging Console, and Speech Debugging Console Log Player. These are tools that enable developers to test and debug both voice-only and multimodal speech-enabled Web applications.
Speech Add-in for Microsoft Internet Explorer Microsoft Internet Explorer and the Speech Add-in render the GUI in multimodal applications, perform speech recognition, and exchange HTML, SALT and script with the Web server.

The following table lists the constituents of the MSS component of the Speech Platform, and summarizes what they do.

Constituent What it does
Telephony Application Services (TAS) In a voice-only application, TAS interprets SALT-enabled pages from the Web server and acts as the client. TAS works with non-Microsoft hardware and software to connect to a telephone switch or public switched telephone network. Telephony Application Services (TAS):
  • Manages the transmission of audio to Microsoft Speech Engine Services (SES) for recognition and takes actions based on the results
  • Manages transmission of marked-up text to SES and of audio back to the caller
  • Allows callers to interact by pressing telephone keys.
Speech Engine Services (SES) Speech Engine Services provides speech recognition and speech output resources primarily for telephony and Pocket PC clients. It can also provide these services for desktop PC clients for debugging purposes. Speech output is provided by SES through either text-to-speech (TTS), playback of prerecorded prompts, or a mixture of TTS and prerecorded prompt playback. SES receives speech and returns recognition results, and receives marked up text and returns speech.
Telephony Interface Manager (TIM) The telephony interface manager (TIM) software is tightly coupled to the installed telephony board, and enables the board to communicate with Microsoft Speech Server (MSS). All audio information, either from a caller or from SES, passes through the TIM software. The TIM software accepts incoming telephone calls and routes them to an instance of the SALT interpreter. The audio portion of the call is sent to SES.
Speech Add-in for Microsoft Pocket Internet Explorer Speech Add-in for Microsoft Pocket Internet Explorer adds speech recognition and speech output capability to the browser, but in contrast to the Speech Add-in for Microsoft Internet Explorer, speech recognition and text-to-speech are performed on SES.

SALT

The Microsoft approach to speech-enabled Web applications is built around a newly emerging standard: SALT. Speech Application Language Tags are a lightweight set of extensions to existing markup languages, in particular HTML and XHTML. SALT enables multimodal and telephony access to information, applications and Web services from PCs, telephones, tablet PCs, and Pocket PCs.

SALT consists of a small set of XML elements, with associated attributes and Document Object Model (DOM) object properties, events and methods, that apply a speech interface to Web pages. The following example presumes a device that can be clicked with a mouse or tapped with a stylus, and a browser that supports events, simple objects and method calling.

Example of SALT Markup

The following example illustrates the partial code for an example that demonstrates binding parts of a recognition result to HTML page elements. The full code for this example is in the SALT bind Element Example document.

In the following code, the listen element contains a grammar element that references an external file containing a grammar of city names. At run time, the listen element creates a listen object that begins listening for words contained in the referenced grammar. The listen element also contains two bind elements that create bind objects at run time.

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html xmlns:salt="http://www.saltforum.org/2002/SALT">
<head>
  <title>bind Element Example</title>
</head>
<body>
  <!-- declare the object on the page-->
  <object id="SpeechTags" CLASSID="clsid:33CBFC53-A7DE-491A-90f3-0E782A7E347A" VIEWASTEXT></object>
  <?import namespace="salt" implementation="#SpeechTags" />

  <salt:listen id="listen1" mode="multiple" onreco="Handleonreco()" onnoreco="Handleonnoreco()"  onsilence="Handleonsilence()" onerror="Handleonerror()">
    <salt:grammar id="gram1" src="./cities.grxml"/> 
    <salt:bind targetelement="boxFromCity" targetattribute="value" value="//origin_city"/>
    <salt:bind targetelement="boxToCity"  targetattribute="value" value="//destination_city"/>
  </salt:listen>
    
  <h3>bind Element Example</h3>
  
  ………
  ………

The values of the targetelement attributes in the bind elements specify the names of the HTML textBox elements that display parts of the recognition result. The values of the value attributes in the bind elements specify XPaths. These XPaths identify locations in the Semantic Markup Language (SML) output produced by the recognizer. These locations in the SML output contain the semantic values that will be displayed in the HTML elements to which the bind objects bind these values.

At run time, as the recognizer produces a recognition result, semantic interpretation markup elements in the referenced grammar generate the semantic values associated with the city names for flight departure and flight arrival. After the recognizer recognizes spoken input and produces an SML output, the run-time bind objects copy the semantic values at the locations identified by the XPaths, and display these semantic values in the HTML textBox elements specified in the SALT markup for the bind elements.

SALT can be used with HTML, XHTML and other standards to write speech interfaces for both voice-only (for example, telephony) and multimodal applications. For more information, see the SALT Forum Web site. The SALT Forum has produced version 1.0 of SALT and has contributed it to Multimodal and Voice Browser Working Groups of the World Wide Web Consortium (W3C). In this documentation, SALT is used as the name for the Microsoft version of the SALT 1.0 specification.

ASP.NET Speech Controls

The SASDK contains a set of ASP.NET server controls called Speech Controls. Speech Controls enable developers to create Web applications that combine the speech capabilities of SALT with the power of ASP.NET.

ASP.NET renders each Speech Control into one or more SALT elements that are interpreted by a client SALT Interpreter. The client must run on one of these two general types of hardware platforms:

  • Tap-and-talk Devices: The client for this type of device runs in a browser that has a graphical user interface (GUI). Authors can create applications in which the user can see and confirm recognition results, and can click buttons and hyperlinks in order to manage application flow or request help.
  • Telephony Devices: The client for this type of device has no GUI. The user interacts with a Web page that is not seen. Applications must speak recognition results to the user for confirmation, and the user must speak or press the telephone keypad in order to manage application flow or request help.

There are four groups of Speech Controls:

  • Basic Speech Controls
    These controls are designed primarily for tap-and-talk Web pages, in which the user confirms recognition results and manages application flow through the GUI. They are server-side representations of the two basic SALT tags, <prompt> and <listen>.
  • Dialog Speech Controls
    These controls are designed for telephony Web pages. Members include controls for the collection and validation of data. A client-side script handles confirmation and application flow without a GUI.
  • Application Speech Controls
    Application Speech Controls are composite controls, composed of Dialog Speech Controls, recognition grammars and prompts, and are designed for the collection of specialized types of commonly-used information, such as dates, currency and credit card numbers.
  • Call Management Controls
    A special subset of Dialog Speech Controls that support Computer-Supported Telecommunications Applications (CSTA) services, and are designed to answer, transfer, initiate, and disconnect telephone calls, as well as gather call information, and send and receive CSTA events. The SASDK also includes a SmexMessage (Simple Messaging Extension) control that is designed to send and receive raw CSTA messages.