How to synchronize real world events happening while speech recognition is happening with individual spoken words

Mark Miller (DevExpress) 0 Reputation points
2024-07-01T11:45:38.4166667+00:00

I am trying to synchronize real world events that are occuring during live streaming of speech to Azure speech recognition services (e.g., eye gaze shifts, hardware device interactions, etc.). I note the time when I start speech recognition and record the time offsets and additional data for each of these external events. Later, when I get the speech data back I'm noticing that the individual word time offsets seem to out of sync with the offset timings I've been tracking for the actual in-room events. I'm finding it challenging to reliably synchronize these events. From what moment in the streamed speech are the word offsets calculated? The beginning of the live stream even if it starts with silence? Or the beginning of the first sound that exceeds a certain threshold in volume (it seems like the latter from my observations, and the AI-generated answer seems to confirm this). Any information that could help me reliable synchronize these real world events with the individual words spoken would be appreciated. If there is a volume threshold (marking the start of speech recognition for Azure cognitive services) that I need to check, what is that value?

Azure AI Speech
Azure AI Speech
An Azure service that integrates speech processing into apps and services.
1,516 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Deleted

    This answer has been deleted due to a violation of our Code of Conduct. The answer was manually reported or identified through automated detection before action was taken. Please refer to our Code of Conduct for more information.


    Comments have been turned off. Learn more