Real time text to speech doesn't work when I'm using a Kinesis Video Stream as a input stream.
Hi All,
I have a problem statement where I'm using the KVS stream to record some data and extract the audio from the KVS stream. After extracting the audio stream I'm using the azure speech service t transcribe the audio into text.
I'm able to extract the audio data and store it into a audio buffer , now I'm pushing this buffer into the azure speech service with the help of push stream.
The problem that I'm facing now is that I'm getting partial results from the speech service , like some of the text gets transcribed and sent back to me where as some time it is sending empty results.
the Size of audio buffer is 1024 bytes.
I have tried resampling the audio sampling rate to16khz since that is the ideal for speech service. but that des not make any difference.
What can I do to get the transcribing in real time? any help will be appreciated here.
I'm posting my code below.
try {
// 1. Create a PushAudioInputStream to stream audio data to Azure
PushAudioInputStream pushStream = AudioInputStream.createPushStream();
// 2. Create an AudioConfig from the PushAudioInputStream
audioInput = AudioConfig.fromStreamInput(pushStream);
logger.info("audioInput: {}", audioInput.toString());
// 3. Create a SpeechConfig with your Azure Speech Service subscription details
config = SpeechConfig.fromSubscription("xxxxxxxxxxxxxxx", "x"xxxxxxxxxxx);
config.setSpeechRecognitionLanguage("en-IN");
logger.info("config: {}", config.toString());
// 4. Create a SpeechRecognizer
SpeechRecognizer recognizer = new SpeechRecognizer(config, audioInput);
// 5. Add an event listener to handle transcription results
recognizer.recognized.addEventListener((s, e) -> {
if (e.getResult().getReason() == ResultReason.RecognizedSpeech) {
logger.info("Inside recognized event");
String transcript = e.getResult().getText();
// Process the transcript (e.g., save to DynamoDB, log, etc.)
System.out.println("Azure Transcript: " + transcript);
}
else if (e.getResult().getReason() == ResultReason.NoMatch) {
System.out.println("NOMATCH: Speech could not be recognized.");
}
});
// Add the 'recognizing' event listener
recognizer.recognizing.addEventListener((s, e) -> {
if (e.getResult().getReason() == ResultReason.RecognizingSpeech) {
logger.info("Inside recognizing event");
String interimTranscript = e.getResult().getText();
// Process the interim transcript
System.out.println("Azure Interim Transcript: " + interimTranscript);
}
});
recognizer.sessionStarted.addEventListener((s, e) -> {
System.out.println("\n Session started event.");
});
recognizer.canceled.addEventListener((s, e) -> {
System.out.println("CANCELED: Reason=" + e.getReason());
if (e.getReason() == CancellationReason.Error) {
System.out.println("CANCELED: ErrorCode=" + e.getErrorCode());
System.out.println("CANCELED: ErrorDetails=" + e.getErrorDetails());
System.out.println("CANCELED: Did you set the speech resource key and region values?");
}
});
recognizer.sessionStopped.addEventListener((s, e) -> {
System.out.println("\n Session stopped event.");
});
// 6. Start continuous recognition
recognizer.startContinuousRecognitionAsync().get();
// 7. Continuously read and push audio chunks
while (true) { // Keep reading until the stream ends
ByteBuffer audioBuffer = KVSUtils.getByteBufferFromStream(
kvsStreamTrackObject.getStreamingMkvReader(),
kvsStreamTrackObject.getFragmentVisitor(),
kvsStreamTrackObject.getTagProcessor(),
contactId,
kvsStreamTrackObject.getTrackName()
);
if (audioBuffer == null || !audioBuffer.hasRemaining()) {
// No more data in the current chunk or stream has ended
logger.info("Audio buffer is empty or null. Potential for transcription gap.");
//Thread.sleep(100); // Wait for 100 milliseconds before checking again
break;
}
// logger.info("audioBuffer: position: {}, limit: {}, capacity: {}", audioBuffer.position(), audioBuffer.limit(), audioBuffer.capacity());
// logger.info("Audio buffer size: {} bytes", audioBuffer.remaining());
// // Log a portion of the audio data in hexadecimal (optional)
// if (audioBuffer.remaining() > 0) {
// int bytesToLog = Math.min(16, audioBuffer.remaining()); // Log up to 16 bytes
// byte[] dataSample = new byte[bytesToLog];
// audioBuffer.get(dataSample); // Copy a sample of the data
// logger.info("Audio data sample: {}", bytesToHex(dataSample));
// }
byte[] audioBytes = new byte[audioBuffer.remaining()];
audioBuffer.get(audioBytes);
// 8. Resample audio from 8kHz to 16kHz using linear interpolation
byte[] resampledAudio = resampleLinear(audioBytes, 8000, 16000);
// logger.info("Audio buffer pushed to the pushStream.");
// 9. Push the resampled audio to the Azure Speech Service
pushStream.write(resampledAudio);
//Thread.sleep(50);
}
// 8. Close the pushStream to signal end of audio
pushStream.close();
// 9. Stop continuous recognition
recognizer.stopContinuousRecognitionAsync().get();
recognizer.close();
} catch (Exception e) {
// Handle exceptions appropriately
e.printStackTrace();
}