Transcriptions - Transcribe

Reference

Service:: Azure AI Services

API Version:: 2024-05-15-preview

Transcribes the provided audio stream.

POST {endpoint}/speechtotext/transcriptions:transcribe?api-version=2024-05-15-preview

URI Parameters

Name	In	Required	Type	Description
audio	formData	True	file binary	The audio as a stream of bytes.
definition	formData	True	string	Metadata for a fast transcription request. This field contains a JSON-serialized object of type `TranscribeDefinition`.
endpoint	path	True	string	Supported Cognitive Services endpoints (protocol and hostname, for example: https://westus.api.cognitive.microsoft.com).
api-version	query	True	string	The requested api version.

Responses

Name	Type	Description
200 OK	TranscribeResult	OK

Security

Ocp-Apim-Subscription-Key

Provide your cognitive services account key here.

Type: apiKey
In: header

Authorization

Provide an access token from the JWT returned by the STS of this region. Make sure to add the management scope to the token by adding the following query string to the STS URL: ?scope=speechservicesmanagement

Type: apiKey
In: header

Examples

Transcribe an audio file

Sample request

HTTP

POST {endpoint}/speechtotext/transcriptions:transcribe?api-version=2024-05-15-preview

Sample response

Status code:: 200

{
  "duration": 2000,
  "combinedPhrases": [
    {
      "text": "Weather"
    }
  ],
  "phrases": [
    {
      "offset": 40,
      "duration": 240,
      "text": "Weather",
      "words": [
        {
          "text": "Weather",
          "offset": 40,
          "duration": 240
        }
      ],
      "locale": "en-US",
      "confidence": 0.7881154
    }
  ]
}

Definitions

Name	Description
CombinedPhrases
Phrase	A transcribed phrase.
TranscribeResult	The result of the transcribe operation.
Word	Time-stamped word in the display form.

CombinedPhrases

Name	Type	Description
channel	integer	The 0-based channel index. Only present if channel separation is enabled.
text	string	The complete transcribed text for the channel.

Phrase

A transcribed phrase.

Name	Type	Description
channel	integer	The 0-based channel index. Only present if channel separation is enabled.
confidence	number	The confidence value for the phrase.
duration	integer	The duration of the phrase in milliseconds.
locale	string	The locale of the phrase.
offset	integer	The start offset of the phrase in milliseconds.
speaker	integer	The speaker number. Only present if speaker diarization is enabled.
text	string	The transcribed text of the phrase.
words	Word[]	The words that make up the phrase. Only present if word-level timestamps are enabled.

TranscribeResult

The result of the transcribe operation.

Name	Type	Description
combinedPhrases	CombinedPhrases[]	The combined transcription results for each channel.
duration	integer	The duration of the audio in milliseconds.
phrases	Phrase[]	The transcription results segmented into phrases.

Word

Time-stamped word in the display form.

Name	Type	Description
duration	integer	The duration of the word in milliseconds.
offset	integer	The start offset of the word in milliseconds.
text	string	The recognized word, including punctuation.

Share via