Note
Please see Azure Cognitive Services for Speech documentation for the latest supported speech solutions.
Microsoft Speech Platform
Grammar Rules and State Graphs
Grammar rules are elements that a speech recognition (SR) engine uses to restrict the possible word or sentence choices during speech recognition. SR engines employ grammar rules to control the elements of sentence construction using a predetermined list of recognized word or phrase choices. This list of recognized words or phrase choices contained in the grammar rules forms the basis of the SR engine vocabulary.
The phrase or sentence uses each grammar rule element to determine the recognition path. For example, examine the phrase describing travel plans, "I would like to drive from Seattle to New York," and note that there are elements that determine the resulting information. In this example, a person is planning to drive to New York from Seattle. This is a very simple illustration of what could be a very complex problem. Determining the same travel plans without limiting the method, direction, and travel destination would result in an infinite number of travel options.
The resulting information can be determined by restricting the available choices for a given sentence. Using this method, the resulting information can be composed only from certain choices, thus eliminating the possibility of an infinite number of travel plan combinations. The diagram in Figure 1 illustrates the choices in a sample phrase for a travel grammar:
Figure 1: Travel Grammar Choices
The elements of interest in the example phrase are as follows:
- Method of travel (fly or drive), specifically "drive"
- Travel direction (from or to), specifically "from"
- The city of origin for the travel plan (from), specifically "Seattle"
- Travel direction compliment (from or to), specifically "to"
- The city of destination for the travel plan (to), specifically "New York"
The information can also be displayed as a graph of states and arcs, where each arc can have text (or semantic tags/properties) attached. The valid phrases are the unique paths through the graph, starting at the root and ending at a terminal state. In the diagram in Figure 2, each state is denoted by the term (root node, interim node, or null) for the terminal node. The spoken text is denoted by words surrounded by quotation marks. The semantic property names are denoted by bold, block quoted words.
Figure 2: Travel Grammar - States, Arcs, and Semantics
If the user speaks the following phrase:
I would like to travel from Seattle to New York.
Grammar rules become concatenated phrase elements. These phrase elements are limited to the defined set of grammars. Control can be significantly improved over the resulting information by restricting the input choice to a limited set of possibilities. Otherwise, obtaining the travel plan information from the same sample phrase, "I would like to travel from Seattle to New York," would be considerably more ambiguous.
The complexity of parsing the same sentence increases exponentially without using a defined set of choices. Imagine the possible number of combinations in a sentence that is not restricted to a finite list of combinations. For example, examine the possible choice combinations by moving the mouse over the following sentence.
To display the available choice selections in the example phrase, move the mouse over the underlined text below:
"I want to—(unknown travel method)—(unknown travel direction)—(unknown city)—(unknown travel direction) (unknown city)." The amount of predictable information is significantly reduced without the ability to constrain the available choices within a sentence.
The semantic structure (using name/value pairs) is:
[METHOD="drive"], [DIRECTION="from"], [CITY_1="Seattle"], [DIRECTION="to"], [CITY_2="New York"]
By parsing the semantic structure, the application can easily and accurately analyze the content of the original phrase, without parsing or analyzing individual words. The application developer can then write application logic to perform specific actions based on the previously mentioned semantic names, and specialize the action based on the values of each semantic property. The grammar author can add to or delete from the lists of words, without breaking the application logic.