How to collect user voice in real-time from the browser and then send it to Azure Speech-to-Text via WebSocket?

CodeKidz 25

I'm almost driven crazy by this problem. The audio stream I capture with MediaRecorder on Chrome only supports the webm format, while the Azure API only supports wav and ogg formats.

And there is no complete example telling me how to create a support for users to input in real time through voice, which is then forwarded to Azure Speech through my own backend service (to prevent key leakage).

All the examples are just directly interfacing with Azure Speech, which is not very useful.

Sedat SALMAN 13,740 Reputation points

2023-12-15T07:04:07.0166667+00:00
You're right, dealing with browser audio formats and sending them to Azure Speech-to-Text can be tricky! Let's break it down and get you past this roadblock:

Capturing and Converting Audio:

Replace MediaRecorder with the WebRTC API's getUserMedia method for capturing microphone audio in a format compatible with browsers (Opus).

Use a JavaScript library like Opus-js to decode the captured Opus stream and encode it into WAV or OGG format accepted by Azure Speech-to-Text.

Sending Audio to Azure Speech-to-Text:

Establish a WebSocket connection between your backend service and Azure Speech-to-Text. This enables real-time streaming of speech data.

Send encoded audio data in chunks instead of the entire audio stream at once. This improves efficiency and avoids overloading the connection.

Your backend service receives audio chunks from the browser, encodes them (if needed), and sends them to Azure Speech-to-Text via the WebSocket. This keeps your API key secure from the browser.

Here are some examples and resources to help you further:

https://video2.skills-academy.com/en-us/windows/apps/design/input/speech-recognition

https://github.com/xiph/opus

https://developer.mozilla.org/en-US/docs/Web/API/WebSocket
CodeKidz 25 Reputation points

2023-12-15T07:28:51.4233333+00:00

Are there any examples that teach users how to use speech-to-text on web clients? All the examples I've found put the key in the frontend (including the official manual), which is really absurd.
Sedat SALMAN 13,740 Reputation points

2023-12-15T08:18:00.73+00:00
Here's an example project outline for a real-time speech-to-text application using Azure Speech-to-Text, WebSockets, and a secure backend approach:

Technologies:

Frontend: HTML, CSS, JavaScript (with libraries like WebRTC and Opus-js) Backend: Node.js (with libraries like Express.js, web-audio-chunks, and Azure Speech Services SDK) Communication: WebSockets Project Structure:

Frontend:

index.html: User interface with microphone access and text display.

script.js: Code for capturing audio via WebRTC, converting to WAV/OGG format using Opus-js, and sending chunks to the backend via WebSockets.

Backend:

server.js: Express.js server that handles incoming WebSocket connections and audio chunks.

speech-service.js: Helper module for interacting with Azure Speech-to-Text using the API key (hidden in this module) and returning transcription results.

Steps:

Frontend:

User grants microphone access. On microphone input, capture audio using navigator.mediaDevices.getUserMedia with audio and opus constraints. Use Opus-js to decode the captured Opus stream into PCM data. Use wavefile.js or ogg.js to encode the PCM data into WAV or OGG format. Divide the encoded audio into chunks using web-audio-chunks. Establish a WebSocket connection to the backend server. Send each audio chunk to the backend server via the WebSocket connection. Backend:

server.js listens for incoming WebSocket connections. On receiving a connection, create a new speech-service.js instance with the hidden API key. On receiving audio chunks from the client, use speech-service.js to send them to Azure Speech-to-Text for real-time transcription. Once transcription results are received, send them back to the client via the same WebSocket connection. Frontend:

Receive transcription results from the backend via the WebSocket connection. Update the user interface with the received text. example code block

JavaScript // Frontend (script.js) const audioStream = await navigator.mediaDevices.getUserMedia({ audio: true, opus: true }); const chunks = webAudioChunks(audioStream, { duration: 1000 }); // Chunk audio every 1 second const socket = new WebSocket('ws://localhost:3000'); socket.onopen = () => { for (const chunk of chunks) { socket.send(chunk); } }; socket.onmessage = (event) => { const text = event.data; // Update UI with the received text }; // Backend (speech-service.js) class SpeechService { constructor(apiKey) { this.apiKey = apiKey; // Initialize Azure Speech Services SDK with the hidden API key } async transcribeChunk(chunk) { // Send the audio chunk to Azure Speech-to-Text // Return the received transcription result } } module.exports = SpeechService;

This is a simplified example, but it should give you a starting point for building your own secure and efficient real-time speech-to-text application with Azure and WebSockets. Remember to adapt the code and libraries to your specific needs and follow best practices for security and communication.

2 answers

CodeKidz 25 Reputation points

2023-12-15T11:21:23.98+00:00

Finally I find this project, https://github.com/Azure-Samples/AzureSpeechReactSample

which shows the correct way to use speech sdk in frontend.

You don't need to connect your own server via websocket, just generate a temporary token and use the Azure Speech SDK directly in the frontend. This project provides a sample implementation using React.
Please sign in to rate this answer.

1 person found this answer helpful.

0 comments No comments
Sign in to comment

Use comments to ask for clarification, additional information, or improvements to the question.
Kenneth Díaz González 0 Reputation points

2024-07-13T07:52:58.36+00:00

I've recently created a sample code for this, feel free to check it out:
https://stackoverflow.com/a/78743136/26354907
Please sign in to rate this answer.

0 comments No comments
Sign in to comment

Use comments to ask for clarification, additional information, or improvements to the question.

Share via

How to collect user voice in real-time from the browser and then send it to Azure Speech-to-Text via WebSocket?

2 answers

Your answer