Is there a way to get begin time and end time for the conversion result of stream audio?

Question

I am using azure-speech to recognize audio stream, from speech_recognition_samples.cpp, from class RecognitionResult I only can get the Text and m_duration, but how can I get the begin time and end time of the result in the speech?
I use azure-speech in this way : write audio stream to AudioInputStream, and get result from SpeechRecognizer

void SpeechContinuousRecognitionWithPushStream()
{
    auto config = SpeechConfig::FromSubscription("YourSubscriptionKey", "YourServiceRegion");

    auto pushStream = AudioInputStream::CreatePushStream();

    auto audioInput = AudioConfig::FromStreamInput(pushStream);
    auto recognizer = SpeechRecognizer::FromConfig(config, audioInput);
    promise recognitionEnd;

    recognizer->Recognizing.Connect([](const SpeechRecognitionEventArgs& e)
    {
        cout << "Recognizing:" << e.Result->Text << std::endl
               << "  Offset=" << e.Result->Offset() << std::endl
              << "  Duration=" << e.Result->Duration() << std::endl;
    });

    recognizer->Recognized.Connect([](const SpeechRecognitionEventArgs& e)
    {
        if (e.Result->Reason == ResultReason::RecognizedSpeech)
        {
            cout << "RECOGNIZED: Text=" << e.Result->Text << std::endl
                << "  Offset=" << e.Result->Offset() << std::endl
                << "  Duration=" << e.Result->Duration() << std::endl;
        }
        else if (e.Result->Reason == ResultReason::NoMatch)
        {
            cout << "NOMATCH: Speech could not be recognized." << std::endl;
        }
    });

    recognizer->Canceled.Connect([&recognitionEnd](const SpeechRecognitionCanceledEventArgs& e)
    {
        switch (e.Reason)
        {
        case CancellationReason::EndOfStream:
            cout << "CANCELED: Reach the end of the file." << std::endl;
            break;

        case CancellationReason::Error:
            cout << "CANCELED: ErrorCode=" << (int)e.ErrorCode << std::endl;
            cout << "CANCELED: ErrorDetails=" << e.ErrorDetails << std::endl;
            recognitionEnd.set_value();
            break;

        default:
            cout << "CANCELED: received unknown reason." << std::endl;
        }

    });

    recognizer->SessionStopped.Connect([&recognitionEnd](const SessionEventArgs& e)
    {
        cout << "Session stopped.";
        recognitionEnd.set_value(); // Notify to stop recognition.
    });

    WavFileReader reader("whatstheweatherlike.wav");

    vector buffer(1000);

    recognizer->StartContinuousRecognitionAsync().wait();

    int readSamples = 0;
    while((readSamples = reader.Read(buffer.data(), (uint32_t)buffer.size())) != 0)
    {
        // Push a buffer into the stream
        pushStream->Write(buffer.data(), readSamples);
    }

    // Close the push stream.
    pushStream->Close();

    // Waits for recognition end.
    recognitionEnd.get_future().get();

    // Stops recognition.
    recognizer->StopContinuousRecognitionAsync().get();
}

Answer

@klen There is an option to request word level timestamps by setting the same in your speech config settings. Similar, to setting the subscription key and region. This however does not explicitly give the begin and ending of a sentence but the offsets and duration of your sentence are mentioned. A sample output should look like below:

# {"Duration":13400000,"NBest":[{"Confidence":0.9761951565742493,"Display":"What's the weather like?","ITN":"What's the weather like","Lexical":"what's the weather like","MaskedITN":"What's the weather like","Words":[{"Duration":3800000,"Offset":600000,"Word":"what's"},{"Duration":1200000,"Offset":4500000,"Word":"the"},{"Duration":2900000,"Offset":5800000,"Word":"weather"},{"Duration":4700000,"Offset":8800000,"Word":"like"}]},{"Confidence":0.9245584011077881,"Display":"what is the weather like","ITN":"what is the weather like","Lexical":"what is the weather like","MaskedITN":"what is the weather like","Words":[{"Duration":2900000,"Offset":600000,"Word":"what"},{"Duration":700000,"Offset":3600000,"Word":"is"},{"Duration":1300000,"Offset":4400000,"Word":"the"},{"Duration":2900000,"Offset":5800000,"Word":"weather"},{"Duration":4700000,"Offset":8800000,"Word":"like"}]}],"Offset":400000,"RecognitionStatus":"Success"}

Share via

Is there a way to get begin time and end time for the conversion result of stream audio?

1 answer