Note
Please see Azure Cognitive Services for Speech documentation for the latest supported speech solutions.
Microsoft Speech Platform
Use TTS Events
The Microsoft Speech Platform includes events that provide notifications and return information about the status of text-to-speech (TTS) operations. Most TTS events are raised by the TTS engine while it is generating synthesized speech. An application can register to receive notification of the events that it wants to process. In addition to subscribing to events, applications should establish a Win32 event to signal when speech events are available.
Example
The following code excerpt subscribes to events that are raised when speech synthesis begins and when the input stream ends, and establishes a Win32 event.
`
// Subscribe to the end stream event and the bookmark event. if (SUCCEEDED(hr)) { ULONGLONG ullEventInterest = SPFEI(SPEI_END_INPUT_STREAM) | SPFEI(SPEI_TTS_BOOKMARK); hr = cpVoice->SetInterest(ullEventInterest, ullEventInterest); }`// Establish a Win32 event to signal when speech events are available. HANDLE hSpeechNotifyEvent = INVALID_HANDLE_VALUE;
if (SUCCEEDED(hr)) { hr = cpVoice->SetNotifyWin32Event(); }
if (SUCCEEDED(hr)) { hSpeechNotifyEvent = cpContext->GetNotifyEventHandle(); if (INVALID_HANDLE_VALUE == hSpeechNotifyEvent) { // Notification handle unsupported. hr = E_NOINTERFACE; } }
See the end of this topic for an example of a complete console application that subscribes to events, generates speech, and processes events raised during speech generation.
TTS events in the Speech Platform
The following table lists and describes the events raised in the Speech Platform to support speech synthesis (TTS).
Event | Description |
---|---|
SPEI_START_INPUT_STREAM | The input stream (text or audio) from a Speak or SpeakStream call has begun synthesizing to the output. The event is fired by the Speech Platform. |
SPEI_END_INPUT_STREAM | The input stream (text or audio) from a Speak or SpeakStream call has finished synthesizing to the output. The event is fired by the Speech Platform. |
SPEI_VOICE_CHANGE | The Speech Platform fires this event for voice changes within a single input stream of a Speak call. wParam is either zero or the SPF_PERSIST_XML. If the current speak call takes SPF_PERSIST_XML, wparam is SPF_PERSIST_XML. Otherwise, zero. lParam is the current voice object token. elParamType has to be SPET_LPARAM_IS_TOKEN. |
SPEI_TTS_BOOKMARK | The bookmark element is used to insert a bookmark into the output stream. If an application specifies interest in bookmark events, it will receive the bookmark events during synthesis. wParam is the current bookmark name (in base 10) converted to a long integer. If name of current bookmark is not an integer, wParam will be zero. lParam is the bookmark string. elParamType has to be SPET_LPARAM_IS_STRING. |
SPEI_WORD_BOUNDARY | A word is beginning to synthesize. Markup language (XML) markers are counted in the boundaries and offsets. wParam is the character length of the word in the current input stream being synthesized. lParam is the character position within the current text input stream of the word being synthesized. |
SPEI_PHONEME | Phoneme was returned by the TTS engine. The high word of wParam is the duration, in milliseconds, of the current phoneme element. The low word is the id of the next phoneme element. The high word of lparam is the phoneme element feature defined in SPVFEATURE. This value will be zero if the current phoneme element is not a primary stress or emphasis. The low word of lParam is the id for the current phoneme element being synthesized. When the engine synthesizes a phoneme comprised of more than one phoneme element, it raises an event for each element. For example, when a Japanese TTS engine speaks the phoneme "KYA," which is comprised of the phoneme elements "KI" and "XYA," it raises an SPEI_PHONEME event for each element. Because the element "KI" in this case modifies the sound of the element following it, rather than initiating a sound, the duration of its SPEI_PHONEME event is zero. |
SPEI_SENTENCE_BOUNDARY | A sentence is beginning to synthesize. wParam is the character length of the sentence including punctuation in the current input stream being synthesized. lParam is the character position within the current text input stream of the sentence being synthesized. |
SPEI_VISEME | Viseme was determined by synthesis engine. The high word of wParam is the duration, in milliseconds, of the current viseme. The low word is for the next viseme of type SPVISEMES. The high word of lParam is the viseme feature defined in SPVFEATURE. This value will be zero if the current viseme is not primary stress or emphasis. The low word of lParam is the current viseme being synthesized. |
SPEI_TTS_AUDIO_LEVEL | This event is fired by the Speech Platform. lParam is 0, and wParam is the current audio level from zero to 100. |
Complete TTS example
The following is an example of a complete console application that subscribes to events, generates speech, and processes events raised during speech generation.
`
int _tmain(int argc, _TCHAR* argv[]) { CoInitialize(NULL); { HRESULT hr = S_OK;`// Find the best token to use for a voice that speaks US English, preferably female. CComPtr<ISpObjectToken> cpVoiceToken; if (SUCCEEDED(hr)) { hr = SpFindBestToken(SPCAT_VOICES, L"language=409", L"gender=female", &cpVoiceToken;); } // Create a voice and set its token to the one we just found. CComPtr<ISpVoice> cpVoice; if (SUCCEEDED(hr)) { hr = cpVoice.CoCreateInstance(CLSID_SpVoice); } if (SUCCEEDED(hr)) { hr = cpVoice->SetVoice(cpVoiceToken); } // Register interest in the END_INPUT_STREAM and TTS_BOOKMARK events. if (SUCCEEDED(hr)) { ULONGLONG ullEventInterest = SPFEI(SPEI_END_INPUT_STREAM) | SPFEI(SPEI_TTS_BOOKMARK); hr = cpVoice->SetInterest(ullEventInterest, ullEventInterest); } // Get the notification event handle so we can wait until we are signaled with events. HANDLE hSpeechNotifyEvent = NULL; if (SUCCEEDED(hr)) { hSpeechNotifyEvent = cpVoice->GetNotifyEventHandle(); if (INVALID_HANDLE_VALUE == hSpeechNotifyEvent) { // Notification handle unsupported by engine. hr = E_NOINTERFACE; } } // Establish a separate win32 event to signal event loop exit. HANDLE hExitEvent = CreateEvent(NULL, FALSE, FALSE, NULL); // Collect the events listened for to pump the speech event loop. HANDLE rghEvents[] = { hSpeechNotifyEvent, hExitEvent }; // Speak an SSML prompt from a file, asynchronously, so we can wait for events. if (SUCCEEDED(hr)) { hr = cpVoice->Speak(L"C:\\Test.ssxml", SPF_IS_FILENAME | SPF_IS_XML | SPF_PARSE_SSML | SPF_ASYNC, 0); } // Speech synthesis event loop; continue until the exit event is signaled. BOOL fContinue = TRUE; while (fContinue && SUCCEEDED(hr)) { // Wait for either a speech event or an exit event. DWORD dwMessage = WaitForMultipleObjects(sp_countof(rghEvents), rghEvents, FALSE, INFINITE); switch (dwMessage) { // With the WaitForMultipleObjects call above, WAIT_OBJECT_0 is a speech event from hSpeechNotifyEvent. case WAIT_OBJECT_0: { // Sequentially grab all available speech events from the speech event queue. CSpEvent spevent; while (S_OK == spevent.GetFrom(cpVoice)) { switch (spevent.eEventId) { case SPEI_END_INPUT_STREAM: // The TTS stream has completed; we should exit the loop. SetEvent(hExitEvent); break; case SPEI_TTS_BOOKMARK: // We just received a bookmark event; print the name of the bookmark reached. LPCWSTR pszBookmarkName = spevent.BookmarkName(); wprintf(L"Bookmark reached: %s\r\n", pszBookmarkName); break; } } break; } case WAIT_OBJECT_0 + 1: { // Exit event; discontinue the speech loop. fContinue = FALSE; break; } } } } CoUninitialize(); return 0;
}
The contents of the SSML document (Test.ssxml) referenced in the code example are as follows:
`
<?xml version="1.0" encoding="ISO-8859-1"?> <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">`<s> Now for today's weather. </s>
</speak>