Transcriptions - Transcribe

Transcribes the provided audio stream.

POST {endpoint}/speechtotext/transcriptions:transcribe?api-version=2024-05-15-preview

URI Parameters

Name In Required Type Description
audio
formData True

file

binary

The audio as a stream of bytes.

definition
formData True

string

Metadata for a fast transcription request. This field contains a JSON-serialized object of type TranscribeDefinition.

endpoint
path True

string

Supported Cognitive Services endpoints (protocol and hostname, for example: https://westus.api.cognitive.microsoft.com).

api-version
query True

string

The requested api version.

Responses

Name Type Description
200 OK

TranscribeResult

OK

Security

Ocp-Apim-Subscription-Key

Provide your cognitive services account key here.

Type: apiKey
In: header

Authorization

Provide an access token from the JWT returned by the STS of this region. Make sure to add the management scope to the token by adding the following query string to the STS URL: ?scope=speechservicesmanagement

Type: apiKey
In: header

Examples

Transcribe an audio file

Sample request

POST {endpoint}/speechtotext/transcriptions:transcribe?api-version=2024-05-15-preview

Sample response

{
  "duration": 2000,
  "combinedPhrases": [
    {
      "text": "Weather"
    }
  ],
  "phrases": [
    {
      "offset": 40,
      "duration": 240,
      "text": "Weather",
      "words": [
        {
          "text": "Weather",
          "offset": 40,
          "duration": 240
        }
      ],
      "locale": "en-US",
      "confidence": 0.7881154
    }
  ]
}

Definitions

Name Description
CombinedPhrases
Phrase

A transcribed phrase.

TranscribeResult

The result of the transcribe operation.

Word

Time-stamped word in the display form.

CombinedPhrases

Name Type Description
channel

integer

The 0-based channel index. Only present if channel separation is enabled.

text

string

The complete transcribed text for the channel.

Phrase

A transcribed phrase.

Name Type Description
channel

integer

The 0-based channel index. Only present if channel separation is enabled.

confidence

number

The confidence value for the phrase.

duration

integer

The duration of the phrase in milliseconds.

locale

string

The locale of the phrase.

offset

integer

The start offset of the phrase in milliseconds.

speaker

integer

The speaker number. Only present if speaker diarization is enabled.

text

string

The transcribed text of the phrase.

words

Word[]

The words that make up the phrase. Only present if word-level timestamps are enabled.

TranscribeResult

The result of the transcribe operation.

Name Type Description
combinedPhrases

CombinedPhrases[]

The combined transcription results for each channel.

duration

integer

The duration of the audio in milliseconds.

phrases

Phrase[]

The transcription results segmented into phrases.

Word

Time-stamped word in the display form.

Name Type Description
duration

integer

The duration of the word in milliseconds.

offset

integer

The start offset of the word in milliseconds.

text

string

The recognized word, including punctuation.