Add real-time transcription into your application
Functionality described in this article is currently in public preview. This preview version is provided without a service-level agreement, and we don't recommend it for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see Supplemental Terms of Use for Microsoft Azure Previews.
This guide helps you better understand the different ways you can use Azure Communication Services offering of real-time transcription through Call Automation SDKs.
- Azure account with an active subscription, for details see Create an account for free.
- Azure Communication Services resource, see Create an Azure Communication Services resource
- Create and connect Azure AI services to your Azure Communication Services resource.
- Create a custom subdomain for your Azure AI services resource.
- Create a new web service application using the Call Automation SDK.
Setup a WebSocket Server
Azure Communication Services requires your server application to set up a WebSocket server to stream transcription in real-time. WebSocket is a standardized protocol that provides a full-duplex communication channel over a single TCP connection. You can optionally use Azure services Azure WebApps that allows you to create an application to receive transcripts over a websocket connection. Follow this quickstart.
Establish a call
In this quickstart, we assume that you're already familiar with starting calls. If you need to learn more about starting and establishing calls, you can follow our quickstart. For the purposes of this quickstart, we're going through the process of starting transcription for both incoming calls and outbound calls.
When working with real-time transcription, you have a couple of options on when and how to start transcription:
Option 1 - Starting at time of answering or creating a call
Option 2 - Starting transcription during an ongoing call
In this tutorial, we're demonstrating option 2, starting transcription during an ongoing call. By default the 'startTranscription' is set to false at time of answering or creating a call.
Create a call and provide the transcription details
Define the TranscriptionOptions for ACS to know whether to start the transcription straight away or at a later time, which locale to transcribe in and the web socket connection to use for sending the transcript.
var createCallOptions = new CreateCallOptions(callInvite, callbackUri)
CallIntelligenceOptions = new CallIntelligenceOptions() { CognitiveServicesEndpoint = new Uri(cognitiveServiceEndpoint) },
TranscriptionOptions = new TranscriptionOptions(new Uri(""), "en-US", false, TranscriptionTransport.Websocket)
CreateCallResult createCallResult = await callAutomationClient.CreateCallAsync(createCallOptions);
Start Transcription
Once you're ready to start the transcription you can make an explicit call to Call Automation to start transcribing the call.
// Start transcription with options
StartTranscriptionOptions options = new StartTranscriptionOptions()
OperationContext = "startMediaStreamingContext",
//Locale = "en-US",
await callMedia.StartTranscriptionAsync(options);
// Alternative: Start transcription without options
// await callMedia.StartTranscriptionAsync();
Receiving Transcription Stream
When transcription starts, your websocket will receive the transcription metadata payload as the first packet. This payload carries the call metadata and locale for the configuration.
"kind": "TranscriptionMetadata",
"transcriptionMetadata": {
"subscriptionId": "835be116-f750-48a4-a5a4-ab85e070e5b0",
"locale": "en-us",
"callConnectionId": "65c57654=f12c-4975-92a4-21668e61dd98",
"correlationId": "65c57654=f12c-4975-92a4-21668e61dd98"
Receiving Transcription data
After the metadata the next packets your web socket receives will be TranscriptionData for the transcribed audio.
"kind": "TranscriptionData",
"transcriptionData": {
"text": "Testing transcription.",
"format": "display",
"confidence": 0.695223331451416,
"offset": 2516998782481234400,
"words": [
"text": "testing",
"offset": 2516998782481234400
"text": "testing",
"offset": 2516998782481234400
"participantRawID": "8:acs:",
"resultStatus": "Final"
Handling transcription stream in the web socket server
using WebServerApi;
var builder = WebApplication.CreateBuilder(args);
var app = builder.Build();
app.Map("/ws", async context =>
if (context.WebSockets.IsWebSocketRequest)
using var webSocket = await context.WebSockets.AcceptWebSocketAsync();
await HandleWebSocket.Echo(webSocket);
context.Response.StatusCode = StatusCodes.Status400BadRequest;
Updates to your code for the websocket handler
using Azure.Communication.CallAutomation;
using System.Net.WebSockets;
using System.Text;
namespace WebServerApi
public class HandleWebSocket
public static async Task Echo(WebSocket webSocket)
var buffer = new byte[1024 * 4];
var receiveResult = await webSocket.ReceiveAsync(
new ArraySegment(buffer), CancellationToken.None);
while (!receiveResult.CloseStatus.HasValue)
string msg = Encoding.UTF8.GetString(buffer, 0, receiveResult.Count);
var response = StreamingDataParser.Parse(msg);
if (response != null)
if (response is AudioMetadata audioMetadata)
Console.WriteLine("MEDIA SUBSCRIPTION ID-->"+audioMetadata.MediaSubscriptionId);
Console.WriteLine("SAMPLE RATE-->"+audioMetadata.SampleRate);
if (response is AudioData audioData)
Console.WriteLine("IS SILENT-->"+audioData.IsSilent);
if (response is TranscriptionMetadata transcriptionMetadata)
Console.WriteLine("TRANSCRIPTION SUBSCRIPTION ID-->"+transcriptionMetadata.TranscriptionSubscriptionId);
Console.WriteLine("CALL CONNECTION ID--?"+transcriptionMetadata.CallConnectionId);
Console.WriteLine("CORRELATION ID-->"+transcriptionMetadata.CorrelationId);
if (response is TranscriptionData transcriptionData)
foreach (var word in transcriptionData.Words)
await webSocket.SendAsync(
new ArraySegment(buffer, 0, receiveResult.Count),
receiveResult = await webSocket.ReceiveAsync(
new ArraySegment(buffer), CancellationToken.None);
await webSocket.CloseAsync(
Update Transcription
For situations where your application allows users to select their preferred language you may also want to capture the transcription in that language. To do this, Call Automation SDK allows you to update the transcription locale.
await callMedia.UpdateTranscriptionAsync("en-US-NancyNeural");
Stop Transcription
When your application needs to stop listening for the transcription, you can use the StopTranscription request to let Call Automation know to stop sending transcript data to your web socket.
StopTranscriptionOptions stopOptions = new StopTranscriptionOptions()
OperationContext = "stopTranscription"
await callMedia.StopTranscriptionAsync(stopOptions);
Create a call and provide the transcription details
Define the TranscriptionOptions for ACS to know whether to start the transcription straight away or at a later time, which locale to transcribe in and the web socket connection to use for sending the transcript.
CallInvite callInvite = new CallInvite(target, caller);
CallIntelligenceOptions callIntelligenceOptions = new CallIntelligenceOptions()
TranscriptionOptions transcriptionOptions = new TranscriptionOptions(
CreateCallOptions createCallOptions = new CreateCallOptions(callInvite, appConfig.getCallBackUri());
Response result = client.createCallWithResponse(createCallOptions, Context.NONE);
return result.getValue().getCallConnectionProperties().getCallConnectionId();
Start Transcription
Once you're ready to start the transcription you can make an explicit call to Call Automation to start transcribing the call.
//Option 1: Start transcription with options
StartTranscriptionOptions transcriptionOptions = new StartTranscriptionOptions()
.startTranscriptionWithResponse(transcriptionOptions, Context.NONE);
// Alternative: Start transcription without options
// client.getCallConnection(callConnectionId)
// .getCallMedia()
// .startTranscription();
Receiving Transcription Stream
When transcription starts, your websocket will receive the transcription metadata payload as the first packet. This payload carries the call metadata and locale for the configuration.
"kind": "TranscriptionMetadata",
"transcriptionMetadata": {
"subscriptionId": "835be116-f750-48a4-a5a4-ab85e070e5b0",
"locale": "en-us",
"callConnectionId": "65c57654=f12c-4975-92a4-21668e61dd98",
"correlationId": "65c57654=f12c-4975-92a4-21668e61dd98"
Receiving Transcription data
After the metadata the next packets your web socket receives will be TranscriptionData for the transcribed audio.
"kind": "TranscriptionData",
"transcriptionData": {
"text": "Testing transcription.",
"format": "display",
"confidence": 0.695223331451416,
"offset": 2516998782481234400,
"words": [
"text": "testing",
"offset": 2516998782481234400
"text": "testing",
"offset": 2516998782481234400
"participantRawID": "8:acs:",
"resultStatus": "Final"
Handling transcription stream in the web socket server
package com.example;
import org.glassfish.tyrus.server.Server;
public class App {
public static void main(String[] args) {
Server server = new Server("localhost", 8081, "/ws", null, WebSocketServer.class);
try {
System.out.println("Web socket running on port 8081...");
BufferedReader reader = new BufferedReader(new InputStreamReader(;
} catch (Exception e) {
} finally {
Updates to your code for the websocket handler
package com.example;
import javax.websocket.OnMessage;
import javax.websocket.Session;
import javax.websocket.server.ServerEndpoint;
public class WebSocketServer {
public void onMessage(String message, Session session) {
StreamingData data = StreamingDataParser.parse(message);
if (data instanceof AudioMetadata) {
AudioMetadata audioMetaData = (AudioMetadata) data;
System.out.println("SUBSCRIPTION ID: --> " + audioMetaData.getMediaSubscriptionId());
System.out.println("ENCODING: --> " + audioMetaData.getEncoding());
System.out.println("SAMPLE RATE: --> " + audioMetaData.getSampleRate());
System.out.println("CHANNELS: --> " + audioMetaData.getChannels());
System.out.println("LENGTH: --> " + audioMetaData.getLength());
if (data instanceof AudioData) {
AudioData audioData = (AudioData) data;
System.out.println("DATA: --> " + audioData.getData());
System.out.println("TIMESTAMP: --> " + audioData.getTimestamp());
System.out.println("IS SILENT: --> " + audioData.isSilent());
if (data instanceof TranscriptionMetadata) {
TranscriptionMetadata transcriptionMetadata = (TranscriptionMetadata) data;
System.out.println("TRANSCRIPTION SUBSCRIPTION ID: --> " + transcriptionMetadata.getTranscriptionSubscriptionId());
System.out.println("IS SILENT: --> " + transcriptionMetadata.getLocale());
System.out.println("CALL CONNECTION ID: --> " + transcriptionMetadata.getCallConnectionId());
System.out.println("CORRELATION ID: --> " + transcriptionMetadata.getCorrelationId());
if (data instanceof TranscriptionData) {
TranscriptionData transcriptionData = (TranscriptionData) data;
System.out.println("TEXT: --> " + transcriptionData.getText());
System.out.println("FORMAT: --> " + transcriptionData.getFormat());
System.out.println("CONFIDENCE: --> " + transcriptionData.getConfidence());
System.out.println("OFFSET: --> " + transcriptionData.getOffset());
System.out.println("DURATION: --> " + transcriptionData.getDuration());
System.out.println("RESULT STATUS: --> " + transcriptionData.getResultStatus());
for (Word word : transcriptionData.getWords()) {
System.out.println("Text: --> " + word.getText());
System.out.println("Offset: --> " + word.getOffset());
System.out.println("Duration: --> " + word.getDuration());
Update Transcription
For situations where your application allows users to select their preferred language you may also want to capture the transcription in that language. To do this, Call Automation SDK allows you to update the transcription locale.
Stop Transcription
When your application needs to stop listening for the transcription, you can use the StopTranscription request to let Call Automation know to stop sending transcript data to your web socket.
// Option 1: Stop transcription with options
StopTranscriptionOptions stopTranscriptionOptions = new StopTranscriptionOptions()
.stopTranscriptionWithResponse(stopTranscriptionOptions, Context.NONE);
// Alternative: Stop transcription without options
// client.getCallConnection(callConnectionId)
// .getCallMedia()
// .stopTranscription();
Create a call and provide the transcription details
Define the TranscriptionOptions for ACS to know whether to start the transcription straight away or at a later time, which locale to transcribe in, and the web socket connection to use for sending the transcript.
const transcriptionOptions = {
transportUrl: "",
transportType: "websocket",
locale: "en-US",
startTranscription: false
const options = {
callIntelligenceOptions: {
cognitiveServicesEndpoint: process.env.COGNITIVE_SERVICES_ENDPOINT
transcriptionOptions: transcriptionOptions
console.log("Placing outbound call...");
acsClient.createCall(callInvite, process.env.CALLBACK_URI + "/api/callbacks", options);
Start Transcription
Once you're ready to start the transcription, you can make an explicit call to Call Automation to start transcribing the call.
const startTranscriptionOptions = {
locale: "en-AU",
operationContext: "startTranscriptionContext"
// Start transcription with options
await callMedia.startTranscription(startTranscriptionOptions);
// Alternative: Start transcription without options
// await callMedia.startTranscription();
Receiving Transcription Stream
When transcription starts, your websocket will receive the transcription metadata payload as the first packet. This payload carries the call metadata and locale for the configuration.
"kind": "TranscriptionMetadata",
"transcriptionMetadata": {
"subscriptionId": "835be116-f750-48a4-a5a4-ab85e070e5b0",
"locale": "en-us",
"callConnectionId": "65c57654=f12c-4975-92a4-21668e61dd98",
"correlationId": "65c57654=f12c-4975-92a4-21668e61dd98"
Receiving Transcription Data
After the metadata, the next packets your web socket receives will be TranscriptionData for the transcribed audio.
"kind": "TranscriptionData",
"transcriptionData": {
"text": "Testing transcription.",
"format": "display",
"confidence": 0.695223331451416,
"offset": 2516998782481234400,
"words": [
"text": "testing",
"offset": 2516998782481234400
"text": "testing",
"offset": 2516998782481234400
"participantRawID": "8:acs:",
"resultStatus": "Final"
Handling transcription stream in the web socket server
import WebSocket from 'ws';
import { streamingData } from '@azure/communication-call-automation/src/util/streamingDataParser';
const wss = new WebSocket.Server({ port: 8081 });
wss.on('connection', (ws) => {
console.log('Client connected');
ws.on('message', (packetData) => {
const decoder = new TextDecoder();
const stringJson = decoder.decode(packetData);
console.log("STRING JSON => " + stringJson);
const response = streamingData(packetData);
if ('locale' in response) {
console.log("Transcription Metadata");
if ('text' in response) {
console.log("Transcription Data");
if ('phoneNumber' in response.participant) {
response.words.forEach((word) => {
ws.on('close', () => {
console.log('Client disconnected');
console.log('WebSocket server running on port 8081');
Update Transcription
For situations where your application allows users to select their preferred language, you may also want to capture the transcription in that language. To do this, the Call Automation SDK allows you to update the transcription locale.
await callMedia.updateTranscription("en-US-NancyNeural");
Stop Transcription
When your application needs to stop listening for the transcription, you can use the StopTranscription request to let Call Automation know to stop sending transcript data to your web socket.
const stopTranscriptionOptions = {
operationContext: "stopTranscriptionContext"
// Stop transcription with options
await callMedia.stopTranscription(stopTranscriptionOptions);
// Alternative: Stop transcription without options
// await callMedia.stopTranscription();
Create a call and provide the transcription details
Define the TranscriptionOptions for ACS to know whether to start the transcription straight away or at a later time, which locale to transcribe in, and the web socket connection to use for sending the transcript.
transcription_options = TranscriptionOptions(
transport_url=" ",
call_connection_properties = call_automation_client.create_call(
Start Transcription
Once you're ready to start the transcription, you can make an explicit call to Call Automation to start transcribing the call.
# Start transcription without options
# Option 1: Start transcription with locale and operation context
# call_connection_client.start_transcription(locale="en-AU", operation_context="startTranscriptionContext")
# Option 2: Start transcription with operation context
# call_connection_client.start_transcription(operation_context="startTranscriptionContext")
Receiving Transcription Stream
When transcription starts, your websocket will receive the transcription metadata payload as the first packet. This payload carries the call metadata and locale for the configuration.
"kind": "TranscriptionMetadata",
"transcriptionMetadata": {
"subscriptionId": "835be116-f750-48a4-a5a4-ab85e070e5b0",
"locale": "en-us",
"callConnectionId": "65c57654=f12c-4975-92a4-21668e61dd98",
"correlationId": "65c57654=f12c-4975-92a4-21668e61dd98"
Receiving Transcription Data
After the metadata, the next packets your websocket receives will be TranscriptionData for the transcribed audio.
"kind": "TranscriptionData",
"transcriptionData": {
"text": "Testing transcription.",
"format": "display",
"confidence": 0.695223331451416,
"offset": 2516998782481234400,
"words": [
"text": "testing",
"offset": 2516998782481234400
"text": "testing",
"offset": 2516998782481234400
"participantRawID": "8:acs:",
"resultStatus": "Final"
Handling transcription stream in the web socket server
import asyncio
import json
import websockets
from azure.communication.callautomation._shared.models import identifier_from_raw_id
async def handle_client(websocket, path):
print("Client connected")
async for message in websocket:
json_object = json.loads(message)
kind = json_object['kind']
if kind == 'TranscriptionMetadata':
print("Transcription metadata")
print("Subscription ID:", json_object['transcriptionMetadata']['subscriptionId'])
print("Locale:", json_object['transcriptionMetadata']['locale'])
print("Call Connection ID:", json_object['transcriptionMetadata']['callConnectionId'])
print("Correlation ID:", json_object['transcriptionMetadata']['correlationId'])
if kind == 'TranscriptionData':
participant = identifier_from_raw_id(json_object['transcriptionData']['participantRawID'])
word_data_list = json_object['transcriptionData']['words']
print("Transcription data")
print("Text:", json_object['transcriptionData']['text'])
print("Format:", json_object['transcriptionData']['format'])
print("Confidence:", json_object['transcriptionData']['confidence'])
print("Offset:", json_object['transcriptionData']['offset'])
print("Duration:", json_object['transcriptionData']['duration'])
print("Participant:", participant.raw_id)
print("Result Status:", json_object['transcriptionData']['resultStatus'])
for word in word_data_list:
print("Word:", word['text'])
print("Offset:", word['offset'])
print("Duration:", word['duration'])
except websockets.exceptions.ConnectionClosedOK:
print("Client disconnected")
except websockets.exceptions.ConnectionClosedError as e:
print("Connection closed with error: %s", e)
except Exception as e:
print("Unexpected error: %s", e)
start_server = websockets.serve(handle_client, "localhost", 8081)
print('WebSocket server running on port 8081')
Update Transcription
For situations where your application allows users to select their preferred language, you may also want to capture the transcription in that language. To do this, the Call Automation SDK allows you to update the transcription locale.
await call_connection_client.update_transcription(locale="en-US-NancyNeural")
Stop Transcription
When your application needs to stop listening for the transcription, you can use the StopTranscription request to let Call Automation know to stop sending transcript data to your web socket.
# Stop transcription without options
# Alternative: Stop transcription with operation context
# call_connection_client.stop_transcription(operation_context="stopTranscriptionContext")
Event codes
Event | code | subcode | Message |
TranscriptionStarted | 200 | 0 | Action completed successfully. |
TranscriptionStopped | 200 | 0 | Action completed successfully. |
TranscriptionUpdated | 200 | 0 | Action completed successfully. |
TranscriptionFailed | 400 | 8581 | Action failed, StreamUrl isn't valid. |
TrasncriptionFailed | 400 | 8565 | Action failed due to a bad request to Cognitive Services. Check your input parameters. |
TranscriptionFailed | 400 | 8565 | Action failed due to a request to Cognitive Services timing out. Try again later or check for any issues with the service. |
TranscriptionFailed | 400 | 8605 | Custom speech recognition model for Transcription is not supported. |
TranscriptionFailed | 400 | 8523 | Invalid Request, locale is missing. |
TranscriptionFailed | 400 | 8523 | Invalid Request, only locale that contain region information are supported. |
TranscriptionFailed | 405 | 8520 | Transcription functionality is not supported at this time. |
TranscriptionFailed | 405 | 8520 | UpdateTranscription is not supported for connection created with Connect interface. |
TranscriptionFailed | 400 | 8528 | Action is invalid, call already terminated. |
TranscriptionFailed | 405 | 8520 | Update transcription functionality is not supported at this time. |
TranscriptionFailed | 405 | 8522 | Request not allowed when Transcription url not set during call setup. |
TranscriptionFailed | 405 | 8522 | Request not allowed when Cognitive Service Configuration not set during call setup. |
TranscriptionFailed | 400 | 8501 | Action is invalid when call is not in Established state. |
TranscriptionFailed | 401 | 8565 | Action failed due to a Cognitive Services authentication error. Check your authorization input and ensure it's correct. |
TranscriptionFailed | 403 | 8565 | Action failed due to a forbidden request to Cognitive Services. Check your subscription status and ensure it's active. |
TranscriptionFailed | 429 | 8565 | Action failed, requests exceeded the number of allowed concurrent requests for the cognitive services subscription. |
TranscriptionFailed | 500 | 8578 | Action failed, not able to establish WebSocket connection. |
TranscriptionFailed | 500 | 8580 | Action failed, transcription service was shut down. |
TranscriptionFailed | 500 | 8579 | Action failed, transcription was canceled. |
TranscriptionFailed | 500 | 9999 | Unknown internal server error. |
Known issues
- For 1:1 calls with ACS users using Client SDKs startTranscription = True isn't currently supported.