Lower speech synthesis latency using Speech SDK

In this article, we introduce the best practices to lower the text to speech synthesis latency and bring the best performance to your end users.

Normally, we measure the latency by first byte latency and finish latency, as follows:

Latency Description SpeechSynthesisResult property key
first byte latency Indicates the time delay between the start of the synthesis task and receipt of the first chunk of audio data. SpeechServiceResponse_SynthesisFirstByteLatencyMs
finish latency Indicates the time delay between the start of the synthesis task and the receipt of the whole synthesized audio data. SpeechServiceResponse_SynthesisFinishLatencyMs

The Speech SDK puts the latency durations in the Properties collection of SpeechSynthesisResult. The following sample code shows these values.

var result = await synthesizer.SpeakTextAsync(text);
Console.WriteLine($"first byte latency: \t{result.Properties.GetProperty(PropertyId.SpeechServiceResponse_SynthesisFirstByteLatencyMs)} ms");
Console.WriteLine($"finish latency: \t{result.Properties.GetProperty(PropertyId.SpeechServiceResponse_SynthesisFinishLatencyMs)} ms");
// you can also get the result id, and send to us when you need help for diagnosis
var resultId = result.ResultId;
Latency Description SpeechSynthesisResult property key
first byte latency Indicates the time delay between the synthesis starts and the first audio chunk is received. SpeechServiceResponse_SynthesisFirstByteLatencyMs
finish latency Indicates the time delay between the synthesis starts and the whole synthesized audio is received. SpeechServiceResponse_SynthesisFinishLatencyMs

The Speech SDK measured the latencies and puts them in the property bag of SpeechSynthesisResult. Refer following codes to get them.

auto result = synthesizer->SpeakTextAsync(text).get();
auto firstByteLatency = std::stoi(result->Properties.GetProperty(PropertyId::SpeechServiceResponse_SynthesisFirstByteLatencyMs));
auto finishedLatency = std::stoi(result->Properties.GetProperty(PropertyId::SpeechServiceResponse_SynthesisFinishLatencyMs));
// you can also get the result id, and send to us when you need help for diagnosis
auto resultId = result->ResultId;
Latency Description SpeechSynthesisResult property key
first byte latency Indicates the time delay between the synthesis starts and the first audio chunk is received. SpeechServiceResponse_SynthesisFirstByteLatencyMs
finish latency Indicates the time delay between the synthesis starts and the whole synthesized audio is received. SpeechServiceResponse_SynthesisFinishLatencyMs

The Speech SDK measured the latencies and puts them in the property bag of SpeechSynthesisResult. Refer following codes to get them.

SpeechSynthesisResult result = synthesizer.SpeakTextAsync(text).get();
System.out.println("first byte latency: \t" + result.getProperties().getProperty(PropertyId.SpeechServiceResponse_SynthesisFirstByteLatencyMs) + " ms.");
System.out.println("finish latency: \t" + result.getProperties().getProperty(PropertyId.SpeechServiceResponse_SynthesisFinishLatencyMs) + " ms.");
// you can also get the result id, and send to us when you need help for diagnosis
String resultId = result.getResultId();
Latency Description SpeechSynthesisResult property key
first byte latency Indicates the time delay between the synthesis starts and the first audio chunk is received. SpeechServiceResponse_SynthesisFirstByteLatencyMs
finish latency Indicates the time delay between the synthesis starts and the whole synthesized audio is received. SpeechServiceResponse_SynthesisFinishLatencyMs

The Speech SDK measured the latencies and puts them in the property bag of SpeechSynthesisResult. Refer following codes to get them.

result = synthesizer.speak_text_async(text).get()
first_byte_latency = int(result.properties.get_property(speechsdk.PropertyId.SpeechServiceResponse_SynthesisFirstByteLatencyMs))
finished_latency = int(result.properties.get_property(speechsdk.PropertyId.SpeechServiceResponse_SynthesisFinishLatencyMs))
# you can also get the result id, and send to us when you need help for diagnosis
result_id = result.result_id
Latency Description SPXSpeechSynthesisResult property key
first byte latency Indicates the time delay between the synthesis starts and the first audio chunk is received. SPXSpeechServiceResponseSynthesisFirstByteLatencyMs
finish latency Indicates the time delay between the synthesis starts and the whole synthesized audio is received. SPXSpeechServiceResponseSynthesisFinishLatencyMs

The Speech SDK measured the latencies and puts them in the property bag of SPXSpeechSynthesisResult. Refer following codes to get them.

SPXSpeechSynthesisResult *speechResult = [speechSynthesizer speakText:text];
int firstByteLatency = [intString [speechResult.properties getPropertyById:SPXSpeechServiceResponseSynthesisFirstByteLatencyMs]];
int finishedLatency = [intString [speechResult.properties getPropertyById:SPXSpeechServiceResponseSynthesisFinishLatencyMs]];
// you can also get the result id, and send to us when you need help for diagnosis
NSString *resultId = result.resultId;

The first byte latency is lower than finish latency in most cases. The first byte latency is independent from text length, while finish latency increases with text length.

Ideally, we want to minimize the user-experienced latency (the latency before user hears the sound) to one network route trip time plus the first audio chunk latency of the speech synthesis service.

Streaming

Streaming is critical to lowering latency. Client code can start playback when the first audio chunk is received. In a service scenario, you can forward the audio chunks immediately to your clients instead of waiting for the whole audio.

You can use the PullAudioOutputStream, PushAudioOutputStream, Synthesizing event, and AudioDataStream of the Speech SDK to enable streaming.

Taking AudioDataStream as an example:

using (var synthesizer = new SpeechSynthesizer(config, null as AudioConfig))
{
    using (var result = await synthesizer.StartSpeakingTextAsync(text))
    {
        using (var audioDataStream = AudioDataStream.FromResult(result))
        {
            byte[] buffer = new byte[16000];
            uint filledSize = 0;
            while ((filledSize = audioDataStream.ReadData(buffer)) > 0)
            {
                Console.WriteLine($"{filledSize} bytes received.");
            }
        }
    }
}

You can use the PullAudioOutputStream, PushAudioOutputStream, the Synthesizing event, and AudioDataStream of the Speech SDK to enable streaming.

Taking AudioDataStream as an example:

auto synthesizer = SpeechSynthesizer::FromConfig(config, nullptr);
auto result = synthesizer->SpeakTextAsync(text).get();
auto audioDataStream = AudioDataStream::FromResult(result);
uint8_t buffer[16000];
uint32_t filledSize = 0;
while ((filledSize = audioDataStream->ReadData(buffer, sizeof(buffer))) > 0)
{
    cout << filledSize << " bytes received." << endl;
}

You can use the PullAudioOutputStream, PushAudioOutputStream, the Synthesizing event, and AudioDataStream of the Speech SDK to enable streaming.

Taking AudioDataStream as an example:

SpeechSynthesizer synthesizer = new SpeechSynthesizer(config, null);
SpeechSynthesisResult result = synthesizer.StartSpeakingTextAsync(text).get();
AudioDataStream audioDataStream = AudioDataStream.fromResult(result);
byte[] buffer = new byte[16000];
long filledSize = audioDataStream.readData(buffer);
while (filledSize > 0) {
    System.out.println(filledSize + " bytes received.");
    filledSize = audioDataStream.readData(buffer);
}

You can use the PullAudioOutputStream, PushAudioOutputStream, the Synthesizing event, and AudioDataStream of the Speech SDK to enable streaming.

Taking AudioDataStream as an example:

speech_synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config, audio_config=None)
result = speech_synthesizer.start_speaking_text_async(text).get()
audio_data_stream = speechsdk.AudioDataStream(result)
audio_buffer = bytes(16000)
filled_size = audio_data_stream.read_data(audio_buffer)
while filled_size > 0:
    print("{} bytes received.".format(filled_size))
    filled_size = audio_data_stream.read_data(audio_buffer)

You can use the SPXPullAudioOutputStream, SPXPushAudioOutputStream, the Synthesizing event, and SPXAudioDataStream of the Speech SDK to enable streaming.

Taking AudioDataStream as an example:

SPXSpeechSynthesizer *synthesizer = [[SPXSpeechSynthesizer alloc] initWithSpeechConfiguration:speechConfig audioConfiguration:nil];
SPXSpeechSynthesisResult *speechResult = [synthesizer startSpeakingText:inputText];
SPXAudioDataStream *stream = [[SPXAudioDataStream alloc] initFromSynthesisResult:speechResult];
NSMutableData* data = [[NSMutableData alloc]initWithCapacity:16000];
while ([stream readData:data length:16000] > 0) {
    // Read data here
}

Pre-connect and reuse SpeechSynthesizer

The Speech SDK uses a websocket to communicate with the service. Ideally, the network latency should be one route trip time (RTT). If the connection is newly established, the network latency includes extra time to establish the connection. The establishment of a websocket connection needs the TCP handshake, SSL handshake, HTTP connection, and protocol upgrade, which introduces time delay. To avoid the connection latency, we recommend pre-connecting and reusing the SpeechSynthesizer.

Pre-connect

To pre-connect, establish a connection to the Speech service when you know the connection is needed soon. For example, if you're building a speech bot in client, you can pre-connect to the speech synthesis service when the user starts to talk, and call SpeakTextAsync when the bot reply text is ready.

using (var synthesizer = new SpeechSynthesizer(uspConfig, null as AudioConfig))
{
    using (var connection = Connection.FromSpeechSynthesizer(synthesizer))
    {
        connection.Open(true);
    }
    await synthesizer.SpeakTextAsync(text);
}
auto synthesizer = SpeechSynthesizer::FromConfig(config, nullptr);
auto connection = Connection::FromSpeechSynthesizer(synthesizer);
connection->Open(true);
SpeechSynthesizer synthesizer = new SpeechSynthesizer(speechConfig, (AudioConfig) null);
Connection connection = Connection.fromSpeechSynthesizer(synthesizer);
connection.openConnection(true);
synthesizer = speechsdk.SpeechSynthesizer(config, None)
connection = speechsdk.Connection.from_speech_synthesizer(synthesizer)
connection.open(True)
SPXSpeechSynthesizer* synthesizer = [[SPXSpeechSynthesizer alloc]initWithSpeechConfiguration:self.speechConfig audioConfiguration:nil];
SPXConnection* connection = [[SPXConnection alloc]initFromSpeechSynthesizer:synthesizer];
[connection open:true];

Note

If the text is available, just call SpeakTextAsync to synthesize the audio. The SDK will handle the connection.

Reuse SpeechSynthesizer

Another way to reduce the connection latency is to reuse the SpeechSynthesizer so you don't need to create a new SpeechSynthesizer for each synthesis. We recommend using object pool in service scenario. See our sample code for C# and Java.

Transmit compressed audio over the network

When the network is unstable or with limited bandwidth, the payload size also affects latency. Meanwhile, a compressed audio format helps to save the users' network bandwidth, which is especially valuable for mobile users.

We support many compressed formats including opus, webm, mp3, silk, and so on, see the full list in SpeechSynthesisOutputFormat. For example, the bitrate of Riff24Khz16BitMonoPcm format is 384 kbps, while Audio24Khz48KBitRateMonoMp3 only costs 48 kbps. The Speech SDK automatically uses a compressed format for transmission when a pcm output format is set. For Linux and Windows, GStreamer is required to enable this feature. Refer this instruction to install and configure GStreamer for Speech SDK. For Android, iOS, and macOS, no extra configuration is needed starting version 1.20.

Input text streaming

Text streaming allows real-time text processing for rapid audio generation. It's perfect for dynamic text vocalization, such as reading outputs from AI models like GPT in real-time. This feature minimizes latency and improves the fluidity and responsiveness of audio outputs, making it ideal for interactive applications, live events, and responsive AI-driven dialogues.

How to use text streaming

Text streaming is supported in C#, C++ and Python with Speech SDK.

To use the text streaming feature, connect to the websocket V2 endpoint: wss://{region}.tts.speech.microsoft.com/cognitiveservices/websocket/v2

See the sample code for setting the endpoint:

// IMPORTANT: MUST use the websocket v2 endpoint
var ttsEndpoint = $"wss://{Environment.GetEnvironmentVariable("AZURE_TTS_REGION")}.tts.speech.microsoft.com/cognitiveservices/websocket/v2";
var speechConfig = SpeechConfig.FromEndpoint(
    new Uri(ttsEndpoint),
    Environment.GetEnvironmentVariable("AZURE_TTS_API_KEY"));

Key steps

  1. Create a text stream request: Use SpeechSynthesisRequestInputType.TextStream to initiate a text stream.

  2. Set global properties: Adjust settings such as output format and voice name directly, as the feature handles partial text inputs and doesn't support SSML. Refer to the following sample code for instructions on how to set them. OpenAI text to speech voices aren't supported by the text streaming feature. See this language table for full language support.

    // Set output format
    speechConfig.SetSpeechSynthesisOutputFormat(SpeechSynthesisOutputFormat.Raw24Khz16BitMonoPcm);
    
    // Set a voice name
    SpeechConfig.SetProperty(PropertyId.SpeechServiceConnection_SynthVoice, "en-US-AvaMultilingualNeural");
    
  3. Stream your text: For each text chunk generated from a GPT model, use request.InputStream.Write(text); to send the text to the stream.

  4. Close the stream: Once the GPT model completes its output, close the stream using request.InputStream.Close();.

For detailed implementation, see the sample code on GitHub

To use the text streaming feature, connect to the websocket V2 endpoint: wss://{region}.tts.speech.microsoft.com/cognitiveservices/websocket/v2

See the sample code for setting the endpoint:

# IMPORTANT: MUST use the websocket v2 endpoint
speech_config = speechsdk.SpeechConfig(endpoint=f"wss://{os.getenv('AZURE_TTS_REGION')}.tts.speech.microsoft.com/cognitiveservices/websocket/v2",
                                       subscription=os.getenv("AZURE_TTS_API_KEY"))

Key steps

  1. Create a text stream request: Use speechsdk.SpeechSynthesisRequestInputType.TextStream to initiate a text stream.

  2. Set global properties: Adjust settings such as output format and voice name directly, as the feature handles partial text inputs and doesn't support SSML. Refer to the following sample code for instructions on how to set them. OpenAI text to speech voices aren't supported by the text streaming feature. See this language table for full language support.

    # set a voice name
    speech_config.speech_synthesis_voice_name = "en-US-AvaMultilingualNeural"
    
  3. Stream your text: For each text chunk generated from a GPT model, use request.input_stream.write(text) to send the text to the stream.

  4. Close the stream: Once the GPT model completes its output, close the stream using request.input_stream.close().

For detailed implementation, see the sample code on GitHub.

The C++ sample code isn't available now. For the sample code that shows how to use text streaming, see:

For the sample code that shows how to use text streaming, see:

For the sample code that shows how to use text streaming, see:

Others tips

Cache CRL files

The Speech SDK uses CRL files to check the certification. Caching the CRL files until expired helps you avoid downloading CRL files every time. See How to configure OpenSSL for Linux for details.

Use latest Speech SDK

We keep improving the Speech SDK's performance, so try to use the latest Speech SDK in your application.

Load test guideline

You can use load test to test the speech synthesis service capacity and latency. Here are some guidelines:

  • The speech synthesis service has the ability to autoscale, but takes time to scale out. If the concurrency is increased in a short time, the client might get long latency or 429 error code (too many requests). So, we recommend you increase your concurrency step by step in load test. See this article for more details, especially this example of workload patterns.
  • You can use our sample using object pool (C# and Java) for load test and getting the latency numbers. You can modify the test turns and concurrency in the sample to meet your target concurrency.
  • The service has quota limitation based on the real traffic, therefore, if you want to perform load test with the concurrency higher than your real traffic, connect before your test.

Next steps