How to synchronize real world events happening while speech recognition is happening with individual spoken words

Mark Miller (DevExpress) 0

I am trying to synchronize real world events that are occuring during live streaming of speech to Azure speech recognition services (e.g., eye gaze shifts, hardware device interactions, etc.). I note the time when I start speech recognition and record the time offsets and additional data for each of these external events. Later, when I get the speech data back I'm noticing that the individual word time offsets seem to out of sync with the offset timings I've been tracking for the actual in-room events. I'm finding it challenging to reliably synchronize these events. From what moment in the streamed speech are the word offsets calculated? The beginning of the live stream even if it starts with silence? Or the beginning of the first sound that exceeds a certain threshold in volume (it seems like the latter from my observations, and the AI-generated answer seems to confirm this). Any information that could help me reliable synchronize these real world events with the individual words spoken would be appreciated. If there is a volume threshold (marking the start of speech recognition for Azure cognitive services) that I need to check, what is that value?

YutongTie-MSFT 48,001 Reputation points

2024-07-02T00:08:30.7033333+00:00
Hello Mark,

Thanks for reaching out to us, there is no public volume threshold to the public as far as any document reference.

If you have specific case you can share with us, we can discuss further or we can help you connect with a support engineer to provide more detailed solution for your scenario.

As the documents we have and answer the question generally -

Azure Speech Services typically starts calculating word offsets from the beginning of recognizable speech. This means it starts from the moment it detects speech that exceeds a certain volume threshold or meets other criteria indicating the start of meaningful audio input.

It does not start from the absolute beginning of the live stream, which may include initial silence or ambient noise.

While Azure Speech Services doesn't publicly disclose the exact volume threshold or criteria for starting recognition, it generally starts processing when it detects meaningful speech sounds. This could involve analyzing audio intensity and characteristics to determine when speech begins.

For better result of a live stream recognition, there are some items you may want to considerate -

-

Ensure that you accurately timestamp your real-world events (e.g., eye gaze shifts, hardware interactions) relative to a common starting point. This starting point should ideally be aligned with when meaningful speech begins as detected by your system.

Use a high-precision clock or timestamping mechanism synchronized across all devices and processes involved in recording these events.

When you receive word-level offsets from Azure Speech Services, adjust them relative to your starting timestamp for speech recognition. This adjustment should account for any delay from the start of recognizable speech to the actual start of processing.

You may need to experiment and calibrate by comparing the word offsets provided by Azure with your recorded events to fine-tune this synchronization.

Be aware that initial silence or non-speech sounds at the beginning of the live stream may not be included in the word offsets provided by Azure. Your synchronization logic should accommodate this by ensuring that real-world events and speech recognition offsets are aligned based on meaningful speech onset.

Please let us know if you have any question, I hope this helps!

Regards,

Yutong
Mark Miller (DevExpress) 0 Reputation points

2024-07-02T13:30:45.2+00:00

Thanks for your reply.

Azure Speech Services typically starts calculating word offsets from the beginning of recognizable speech. This means it starts from the moment it detects speech that exceeds a certain volume threshold or meets other criteria indicating the start of meaningful audio input.

Without knowing the algorithm or threshold used by Azure Speech Services to detect this offset, there seems to be no way to synchronize external in-room events reliably. What is the best way to feature request the team include this TimeSpan offset (from the beginning of the stream to the moment voice was first detected) in the results returned in the WordsRecognized event?

1 answer

Deleted

This answer has been deleted due to a violation of our Code of Conduct. The answer was manually reported or identified through automated detection before action was taken. Please refer to our Code of Conduct for more information.

Comments have been turned off. Learn more

Share via

How to synchronize real world events happening while speech recognition is happening with individual spoken words

1 answer