Real time text to speech doesn't work when I'm using a Kinesis Video Stream as a input stream.

Question

Hi All,

I have a problem statement where I'm using the KVS stream to record some data and extract the audio from the KVS stream. After extracting the audio stream I'm using the azure speech service t transcribe the audio into text.

I'm able to extract the audio data and store it into a audio buffer , now I'm pushing this buffer into the azure speech service with the help of push stream.

The problem that I'm facing now is that I'm getting partial results from the speech service , like some of the text gets transcribed and sent back to me where as some time it is sending empty results.

the Size of audio buffer is 1024 bytes.

I have tried resampling the audio sampling rate to16khz since that is the ideal for speech service. but that des not make any difference.

What can I do to get the transcribing in real time? any help will be appreciated here.

I'm posting my code below.

try {
        // 1. Create a PushAudioInputStream to stream audio data to Azure
        PushAudioInputStream pushStream = AudioInputStream.createPushStream();

        // 2. Create an AudioConfig from the PushAudioInputStream
        audioInput = AudioConfig.fromStreamInput(pushStream);
        logger.info("audioInput: {}", audioInput.toString());

        // 3. Create a SpeechConfig with your Azure Speech Service subscription details
        config = SpeechConfig.fromSubscription("xxxxxxxxxxxxxxx", "x"xxxxxxxxxxx);
        config.setSpeechRecognitionLanguage("en-IN"); 
        logger.info("config: {}", config.toString());

        // 4. Create a SpeechRecognizer
        SpeechRecognizer recognizer = new SpeechRecognizer(config, audioInput);

        // 5. Add an event listener to handle transcription results
        recognizer.recognized.addEventListener((s, e) -> {
            if (e.getResult().getReason() == ResultReason.RecognizedSpeech) {
                logger.info("Inside recognized event");
                String transcript = e.getResult().getText();
                // Process the transcript (e.g., save to DynamoDB, log, etc.)
                System.out.println("Azure Transcript: " + transcript);
            }
            else if (e.getResult().getReason() == ResultReason.NoMatch) {
        System.out.println("NOMATCH: Speech could not be recognized.");
    }
        });

        // Add the 'recognizing' event listener
        recognizer.recognizing.addEventListener((s, e) -> {
            if (e.getResult().getReason() == ResultReason.RecognizingSpeech) {
                logger.info("Inside recognizing event");
                String interimTranscript = e.getResult().getText();
                // Process the interim transcript 
                System.out.println("Azure Interim Transcript: " + interimTranscript);
            }
        });

        recognizer.sessionStarted.addEventListener((s, e) -> {
                System.out.println("
    Session started event.");
            });

            recognizer.canceled.addEventListener((s, e) -> {
            System.out.println("CANCELED: Reason=" + e.getReason());

            if (e.getReason() == CancellationReason.Error) {
                System.out.println("CANCELED: ErrorCode=" + e.getErrorCode());
                System.out.println("CANCELED: ErrorDetails=" + e.getErrorDetails());
                System.out.println("CANCELED: Did you set the speech resource key and region values?");
            }
            });


            recognizer.sessionStopped.addEventListener((s, e) -> {
                System.out.println("
    Session stopped event.");
            });

        // 6. Start continuous recognition
        recognizer.startContinuousRecognitionAsync().get();
        
        // 7. Continuously read and push audio chunks
            while (true) { // Keep reading until the stream ends
            ByteBuffer audioBuffer = KVSUtils.getByteBufferFromStream(
                kvsStreamTrackObject.getStreamingMkvReader(),
                kvsStreamTrackObject.getFragmentVisitor(),
                kvsStreamTrackObject.getTagProcessor(),
                contactId,
                kvsStreamTrackObject.getTrackName()
            );

            if (audioBuffer == null || !audioBuffer.hasRemaining()) {
                // No more data in the current chunk or stream has ended
                logger.info("Audio buffer is empty or null. Potential for transcription gap.");
                //Thread.sleep(100); // Wait for 100 milliseconds before checking again
                break;
            }

            // logger.info("audioBuffer: position: {}, limit: {}, capacity: {}", audioBuffer.position(), audioBuffer.limit(), audioBuffer.capacity());
            // logger.info("Audio buffer size: {} bytes", audioBuffer.remaining());

            // // Log a portion of the audio data in hexadecimal (optional)
            // if (audioBuffer.remaining() > 0) {
            //     int bytesToLog = Math.min(16, audioBuffer.remaining()); // Log up to 16 bytes
            //     byte[] dataSample = new byte[bytesToLog];
            //     audioBuffer.get(dataSample); // Copy a sample of the data
            //     logger.info("Audio data sample: {}", bytesToHex(dataSample));
            // }

        byte[] audioBytes = new byte[audioBuffer.remaining()];
        audioBuffer.get(audioBytes);

        // 8. Resample audio from 8kHz to 16kHz using linear interpolation
        byte[] resampledAudio = resampleLinear(audioBytes, 8000, 16000);
        //    logger.info("Audio buffer pushed to the pushStream.");
        // 9. Push the resampled audio to the Azure Speech Service
        pushStream.write(resampledAudio); 
            //Thread.sleep(50);
            }
            // 8. Close the pushStream to signal end of audio
            pushStream.close();

            // 9. Stop continuous recognition 
            recognizer.stopContinuousRecognitionAsync().get();
            recognizer.close();
            

    } catch (Exception e) {
        // Handle exceptions appropriately
        e.printStackTrace();
    }

Share via

Real time text to speech doesn't work when I'm using a Kinesis Video Stream as a input stream.

Your answer