Note
Please see Azure Cognitive Services for Speech documentation for the latest supported speech solutions.
Microsoft Speech Platform
Designing Grammar Rules
A speech recognition grammar is a container of language rules that define a set of constraints that a speech recognizer can use to perform recognition. Grammar authors can increase the flexibility and power of a grammar by adding semantic information and by thoughtful use of dynamic rules that can be modified at application runtime.
Adding semantics to a grammar effectively isolates the text of recognized speech from the business logic of your application. The semantics of a grammar rule translate all the speech input options that a grammar rule defines into a result that your application expects. For example, a rule in a simple yes/no grammar may recognize any of the words "yes", "yep", "yeah", or "yup", but will generate the semantic result "yes" to the application.
Dynamic rules can be modified at run time to optimize an application's response to specific recognition scenarios, for example a phone call from a particular area code.
The following section covers:
- Semantic Properties or Tags
- Separation of Dynamic and Static Content
- Use Dynamic Rules for Language Flexibility
- Retrieving Semantic Tags or Properties from Recognition Results
- Using Semantic Properties, Hypotheses, and "Property Pushing"
Semantic properties or tags
For example, the phrase "Schedule a meeting with Nancy Anderson" could be annotated as follows:
`Phrase element Grammar element Contents
"schedule a meeting" "request: meeting" // attribute and value "with" "participants:" // only attribute "Nancy Anderson" "<e-mail alias>" // value type `
Defining the different grammar element components could result in the following:
<pre IsFakePre="true" xmlns="http://www.w3.org/1999/xhtml">Schedule a meeting with Nancy Anderson. | | | | | | | | | | | | request: meeting | | | | participants: NanAnd </pre>
The example sentence "Schedule a meeting with Nancy Anderson" generates the following SRGS XML grammar:
`
<?xml version="1.0" encoding="utf-8"?>`<grammar xml:lang="en-US" root="scheduleMeeting" tag-format="semantics/1.0" version="1.0" xmlns="http://www.w3.org/2001/06/grammar">
<rule id="scheduleMeeting"> <item> Schedule a meeting <tag> out.request="meeting" </tag> </item> <item> with </item> <ruleref uri="#participants" /> <tag> out.participants = rules.latest(); </tag> </rule>
<rule id="participants"> <one-of> <item> Nancy Anderson <tag> out="NanAnd" </tag> </item> <item> Alan Brewer <tag> out="abrewer" </tag> </item> <item> Oliver Lee <tag> out="olilee" </tag> </item> <item> April Reagan <tag> out="areagan" </tag> </item> <item> Cindy White <tag> out="cwhite" </tag> </item> <item> Ken Kwok <tag> out="kkwok" </tag> </item> </one-of> </rule>
</grammar>
The result of saying the example sentence "Schedule a meeting with Nancy Anderson" would be as follows:
request:meeting
participants:NanAnd
For more information about support in the Speech Platform for authoring grammars with semantics, see Semantic Interpretation Markup (Microsoft.Speech) and Semantic Markup Language Reference (Microsoft.Speech).
Back to top
Separation of dynamic and static content
Applications should separate dynamic rule content from static rule content to facilitate modifying dynamic rules at run time. For example, using the above grammar that contains a list of names, the application could create a separate rule (isolated in its own grammar) that contained only the names. The list of names, based on an address book or past user data, can be updated at run time. The static grammar would then contain a rule reference to the dynamic content. When the application starts up, it can quickly load the static content to prevent delay in the startup sequence. Then, the application could load the dynamic content, which requires the Speech Platform to initialize the back-end grammar compiler.
Use dynamic rules for language flexibility
Suppose an application needs to support a phrase such as "send new e-mail to NAME." The phrase "send new e-mail to" is static, and known by the application at design time, well before run time. The application could use the following static XML grammar to support these phrases.
`
<?xml version="1.0" encoding="utf-8"?>`<grammar xml:lang="en-US" root="email" tag-format="semantics/1.0" version="1.0" xmlns="http://www.w3.org/2001/06/grammar" xmlns:sapi="https://schemas.microsoft.com/Speech/2002/06/SRGSExtensions">
<rule id="email"> <item> Send new e-mail to </item> <ruleref uri="#addressBook" /> <tag> out.addressee=rules.latest(); </tag><!-- Add semantic property tag for easy information retrieval --> </rule>
<rule id="addressBook" sapi:dynamic="true"> <item> placeholder </item><!-- We will replace this placeholder text immediately at run time --> </rule>
</grammar>
Back to top
The following is the source code to manipulate the dynamic rule, "addressBook" (error checking omitted for brevity):
`
HRESULT hr = S_OK;`// Create a new grammar object. hr = cpRecoContext->CreateGrammar(GRAM_ID, &cpRecoGrammar); // Check hr // Deactivate the grammar to prevent premature recognitions to an "under-construction" grammar. hr = cpRecoGrammar->SetGrammarState(SPGS_DISABLED); // Check hr // Load the email grammar dynamically, so changes can be made at run time. hr = cpRecoGrammar->LoadCmdFromFile(L"email.xml", SPLO_DYNAMIC); // Check hr SPSTATEHANDLE hRule; // Retrieve the dynamic rule addressBook. hr = cpRecoGrammar->GetRule(L"addressBook", NULL, SPRAF_Dynamic, FALSE, &hRule); // Check hr // Clear the placeholder text, and everything else in the dynamic addressBook rule. hr = cpRecoGrammar->ClearRule(hRule); // Check hr // Add the real address book (for example "Oliver Lee", "Alan Brewer", "April Reagan", and other entries.). // Note that ISpRecoGrammar inherits from ISpGrammarBuilder, // so the application gets the grammar compiler and ::AddWordTransition for free! hr = cpRecoGrammar->AddWordTransition(hRule, NULL, L"Oliver Lee", NULL, SPWT_LEXICAL, 1, NULL); // Check hr hr = cpRecoGrammar->AddWordTransition(hRule, NULL, L"Alan Brewer", NULL, SPWT_LEXICAL, 1, NULL); // Check hr hr = cpRecoGrammar->AddWordTransition(hRule, NULL, L"April Reagan", NULL, SPWT_LEXICAL, 1, NULL); // Check hr // ... Add rest of address book // Commit the grammar changes, which updates the loaded grammar, // and notifies the SR Engine about the rule change (such as "addressBook" hr = cpRecoGrammar->Commit(NULL); // Check hr // Activate the grammar since "construction" is finished, // and the grammar is ready to receive recognitions. hr = cpRecoGrammar->SetGrammarState(SPGS_ENABLED); // Check hr
Back to top
Retrieving semantic tags or properties from recognition results
Note the XML grammar uses a semantic property tag, out.addressee=rules.latest(), in the static grammar. The property will enable the application to retrieve the dynamic phrase very easily at run time. Whenever recognition is received with rule name, "email," search the property tree (see SPPHRASE.pProperties) for the property named "addressee." Then call ISpRecoResult::GetPhrase with (SPPHRASEPROPERTY)pNameProp.ulFirstElement and (SPPHRASEPROPERTY)pNameProp.ulFirstElement, and the application can retrieve the exact text that the user spoke into the dynamic rule (for example, the user says "send new e-mail to Oliver Lee," and you retrieve "Oliver Lee," user says "send new e-mail to April Reagan," and you retrieve "April Reagan," and so on).
`
// Activate the email rule to begin receiving recognitions. hr = cpRecoGrammar->SetRuleState(L"email", NULL, SPRS_ACTIVE); // Check hr`PWCHAR pwszEmailName = NULL; // Default event interest is recognition, so wait for recognition event // NOTE: this could be placed in a loop to process multiple recognitions hr = cpRecoContext->WaitForNotifyEvent(MY_REASONABLE_TIMEOUT); // Check hr // event notification fired if (S_OK == hr) { CSpEvent spEvent; // If event is retrieved and it is a recognition... if (S_OK == spEvent.GetFrom(cpRecoContext) && SPEI_RECOGNITION == spEvent.eEventId) { // Get the recognition result. CComPtr<ISpRecoResult> cpRecoResult = spEvent.RecoResult(); if (cpRecoResult) { SPPHRASE* pPhrase = NULL; // Get the phrase object from the recognition result. hr = cpRecoResult->GetPhrase(&pPhrase); if (SUCCEEDED(hr) && pPhrase) { // if "email" rule was recognized ... if (0 == wcscmp(L"email", pPhrase->Rule.pszName) { // ... ensure that first property is "addressee" if (0 == wcscmp(L"addressee", pPhrase->pProperties->pszName)) { // Store the user's spoken "send-to" name // in a variable for later processing. hr = pPhrase->GetText(pPhrase->pProperties->ulFirstElement, pPhrase->pProperties->ulCountOfElements, FALSE, &pwszEmailName, NULL); // Check hr } } // Free the phrase object. if (pPhrase) ::CoTaskMemFree(pPhrase); } } }
}
Back to top
Using semantic properties, hypotheses, and "property pushing"
The Microsoft Speech Platform supports a feature called "semantic property pushing" which enables applications to detect the semantic property structure more accurately at recognition time. "Property pushing" is done by the Speech Platform at compile time, whereby the compiler moves semantic properties to the last terminal node within a rule that remains unambiguous.
For example, the phrases "a b c d" and "a b e f g" both have prefixes of "a b". The compiler will automatically split the phrases into three separate phrases, "a b", "c d", and "e f g", where the first phrase is the common prefix to both recognizable phrases.
The purpose of this feature is to enable applications that place properties on the phrases to detect which branch is being hypothesized as soon as the first unambiguous (non-common) portion of the phrase is spoken. When the user speaks "a b" it is not clear if the user will say "a b c d" or "a b e f g". If the user then says "e", the application can obviously eliminate the "a b c d" option. If the grammar author attached properties to the end of both phrases, the semantic property would be returned as soon as the user spoke the first unambiguous portion of the text (for example, "c" or "e"). The following SRGS grammar illustrates one way to accomplish this:
`
<?xml version="1.0" encoding="UTF-8" ?>`<grammar version="1.0" xml:lang="en-US" xmlns="http://www.w3.org/2001/06/grammar" tag-format="semantics/1.0" root="Main">
<rule id="Main"> <item> a </item> <item> b </item> <one-of> <item> c <tag> out="a,b,c,d"; </tag> </item> <item> e <tag> out="a,b,e,f,g"; </tag> </item> </one-of> </rule>
</grammar>
Note that the compiler will report an error ("Ambiguous Semantic Property") if multiple properties are pushed to the same node and two phrases are not unique. For example, the following grammar will fail with "ambiguous semantic property" because both phrases are the same and the compiler cannot determine which property to assign to phrases.
<pre IsFakePre="true" xmlns="http://www.w3.org/1999/xhtml"> <rule id="AmbiguousProperty" > <one-of> <item> this is a test <tag> out="42" </tag> </item> <item> different sentence <tag> out="3" </tag> </item> <item> this is a test <tag> out="75" </tag> </item> </one-of> </rule> </pre>
The first and third phrases are the same. Note that these results are by design and are meant to prevent creating grammars that have multiple phrases with conflicting semantic properties.
There are a number of scenarios where property pushing can be helpful for an application.
One possibility is an application that wants to detect failures more intelligently. When a false recognition occurs, the application can detect the last semantic property returned and display an error message relevant to the attempted voice command.
Another scenario might be that of high-performance applications that wish to increase responsiveness of the user interface when a long voice command is spoken. The application can wait for the first unambiguous semantic property to be received (using hypothesis) and then fire the response action without waiting for the voice command to complete. This has the added benefit of allowing users to speak partial voice commands (for example, instead of "go to website w w w Microsoft com" the user can say the slightly shorter "go to website w w w Microsoft"). The drawback is that the application must guard against performing critical, unrecoverable actions before completing the phrase (for example, "delete hard drive" might fire after only "delete" if there are no other "delete" commands). Careful application design should enable the application to appear quicker and easier to use, without sacrificing robustness. By performing user studies, the application designer can decide which commands are capable of short circuiting and which are more critical.
Back to top