Speech-to-Text API, where can I found timestamps for DisplayText

Question

Hello,

We are building a UI where the timestamp and confidence for each word will be displayed alongside with the transcript. My only question is there any way we can find the timestamps for Normalized Text (after ITN, capitalization, punctuation detection etc..)?

For example,

the lexical text display: "nineteen eighty four" and we have the timestamps and confidences for each word, but we want to show on the UI is a single word "1984" and a single timestamp, perhaps the confidence in that case can just be the average.

Of course it isn't a big task to re-do the normalization ourself on those lexical words and try to match the timestamps with to the DisplayText. For that number example, we can show the timestamp of "nineteen" for "1984". But it would be nice if we can get that information from the API instead of spending time reverse-engineering the process which could be error-prone.

Hope that make sense and thanks for reading my question. Big thanks to you all.

Chulin

Answer

Hello,

Thanks for reaching out to us. Could you please try below? Let me know if you have more questions.

Please set:

speech_config.request_word_level_timestamps()

in the speech config of azure sdk will allow you to get the transcripts along with the timestamps for each word.

speech_config.output_format = speechsdk.OutputFormat(1)

This statement would allow you get the detailed json object from the azure sdk.

Below is a sample code. Make sure you replace the keys. Some error handling might be needed at places where speech to text could fail.

def process(self):
    logger.debug("Speech to text request received")

    speechapi_settings =  SpeechAPIConf()
    audio_filepath = 
    locale = "en-US" # Change as per requirement

    logger.debug(audio_filepath)
    audio_config = speechsdk.audio.AudioConfig(filename=audio_filepath) 
    speech_config = speechsdk.SpeechConfig(subscription=, region=)
    speech_config.request_word_level_timestamps()
    speech_config.speech_recognition_language = locale
    speech_config.output_format = speechsdk.OutputFormat(1)


    # Creates a recognizer with the given settings
    speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_config)

    # Variable to monitor status
    done = False

    # Service callback for recognition text 
    transcript_display_list = []
    transcript_ITN_list = []
    confidence_list = []
    words = []
    def parse_azure_result(evt):
        import json
        response = json.loads(evt.result.json)
        transcript_display_list.append(response['DisplayText'])
        confidence_list_temp = [item.get('Confidence') for item in response['NBest']]
        max_confidence_index = confidence_list_temp.index(max(confidence_list_temp))
        confidence_list.append(response['NBest'][max_confidence_index]['Confidence'])
        transcript_ITN_list.append(response['NBest'][max_confidence_index]['ITN'])
        words.extend(response['NBest'][max_confidence_index]['Words'])
        logger.debug(evt)

    # Service callback that stops continuous recognition upon receiving an event `evt`
    def stop_cb(evt):
        print('CLOSING on {}'.format(evt))
        speech_recognizer.stop_continuous_recognition()
        nonlocal done
        done = True

        # Do something with the combined responses
        print(transcript_display_list)
        print(confidence_list)
        print(words)


    # Connect callbacks to the events fired by the speech recognizer
    speech_recognizer.recognizing.connect(lambda evt: logger.debug('RECOGNIZING: {}'.format(evt)))
    speech_recognizer.recognized.connect(parse_azure_result)
    speech_recognizer.session_started.connect(lambda evt: logger.debug('SESSION STARTED: {}'.format(evt)))
    speech_recognizer.session_stopped.connect(lambda evt: logger.debug('SESSION STOPPED {}'.format(evt)))
    speech_recognizer.canceled.connect(lambda evt: logger.debug('CANCELED {}'.format(evt)))
    # stop continuous recognition on either session stopped or canceled events
    speech_recognizer.session_stopped.connect(stop_cb)
    speech_recognizer.canceled.connect(stop_cb)

    # Start continuous speech recognition
    logger.debug("Initiating speech to text")
    speech_recognizer.start_continuous_recognition()
    while not done:
        time.sleep(.5)

Share via

Speech-to-Text API, where can I found timestamps for DisplayText

1 answer

Your answer