Speech-to-Text API, where can I found timestamps for DisplayText

Chulin Liang 1 Reputation point
2020-10-26T15:43:21.693+00:00

Hello,

We are building a UI where the timestamp and confidence for each word will be displayed alongside with the transcript. My only question is there any way we can find the timestamps for Normalized Text (after ITN, capitalization, punctuation detection etc..)?

For example,

the lexical text display: "nineteen eighty four" and we have the timestamps and confidences for each word, but we want to show on the UI is a single word "1984" and a single timestamp, perhaps the confidence in that case can just be the average.

Of course it isn't a big task to re-do the normalization ourself on those lexical words and try to match the timestamps with to the DisplayText. For that number example, we can show the timestamp of "nineteen" for "1984". But it would be nice if we can get that information from the API instead of spending time reverse-engineering the process which could be error-prone.

Hope that make sense and thanks for reading my question. Big thanks to you all.

Chulin

Azure AI Speech
Azure AI Speech
An Azure service that integrates speech processing into apps and services.
1,657 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. YutongTie-MSFT 50,716 Reputation points
    2020-10-27T06:29:49.457+00:00

    Hello,

    Thanks for reaching out to us. Could you please try below? Let me know if you have more questions.

    Please set:

    speech_config.request_word_level_timestamps()
    

    in the speech config of azure sdk will allow you to get the transcripts along with the timestamps for each word.

    speech_config.output_format = speechsdk.OutputFormat(1)
    

    This statement would allow you get the detailed json object from the azure sdk.

    Below is a sample code. Make sure you replace the keys. Some error handling might be needed at places where speech to text could fail.

    def process(self):
        logger.debug("Speech to text request received")
    
        speechapi_settings =  SpeechAPIConf()
        audio_filepath = <PATH_TO_AUDIO_FILE>
        locale = "en-US" # Change as per requirement
    
        logger.debug(audio_filepath)
        audio_config = speechsdk.audio.AudioConfig(filename=audio_filepath) 
        speech_config = speechsdk.SpeechConfig(subscription=<SUBSCRIPTION_KEY>, region=<SERVICE_REGION>)
        speech_config.request_word_level_timestamps()
        speech_config.speech_recognition_language = locale
        speech_config.output_format = speechsdk.OutputFormat(1)
    
    
        # Creates a recognizer with the given settings
        speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_config)
    
        # Variable to monitor status
        done = False
    
        # Service callback for recognition text 
        transcript_display_list = []
        transcript_ITN_list = []
        confidence_list = []
        words = []
        def parse_azure_result(evt):
            import json
            response = json.loads(evt.result.json)
            transcript_display_list.append(response['DisplayText'])
            confidence_list_temp = [item.get('Confidence') for item in response['NBest']]
            max_confidence_index = confidence_list_temp.index(max(confidence_list_temp))
            confidence_list.append(response['NBest'][max_confidence_index]['Confidence'])
            transcript_ITN_list.append(response['NBest'][max_confidence_index]['ITN'])
            words.extend(response['NBest'][max_confidence_index]['Words'])
            logger.debug(evt)
    
        # Service callback that stops continuous recognition upon receiving an event `evt`
        def stop_cb(evt):
            print('CLOSING on {}'.format(evt))
            speech_recognizer.stop_continuous_recognition()
            nonlocal done
            done = True
    
            # Do something with the combined responses
            print(transcript_display_list)
            print(confidence_list)
            print(words)
    
    
        # Connect callbacks to the events fired by the speech recognizer
        speech_recognizer.recognizing.connect(lambda evt: logger.debug('RECOGNIZING: {}'.format(evt)))
        speech_recognizer.recognized.connect(parse_azure_result)
        speech_recognizer.session_started.connect(lambda evt: logger.debug('SESSION STARTED: {}'.format(evt)))
        speech_recognizer.session_stopped.connect(lambda evt: logger.debug('SESSION STOPPED {}'.format(evt)))
        speech_recognizer.canceled.connect(lambda evt: logger.debug('CANCELED {}'.format(evt)))
        # stop continuous recognition on either session stopped or canceled events
        speech_recognizer.session_stopped.connect(stop_cb)
        speech_recognizer.canceled.connect(stop_cb)
    
        # Start continuous speech recognition
        logger.debug("Initiating speech to text")
        speech_recognizer.start_continuous_recognition()
        while not done:
            time.sleep(.5)
    
    1 person found this answer helpful.

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.