Improving Speech to Text Accuracy for Industry-Specific Terminology with Azure AI Service

KT 150 Reputation points
2024-07-09T07:38:58.95+00:00

Hi all,

I want to improve the accuracy of reading industry-specific terminology(in Japanese) using Azure AI service's Speech to Text. The challenge is that these terms can have different meanings in general contexts versus industry-specific contexts. How much training data is required to achieve an accuracy of over 80% in determining the correct context and converting speech to text accurately?

I understand we need to test and provide training data, but we would like to know the minimum amount of data we need to prepare. If it is too much, we may need to consider alternative options or give up.

Azure AI Speech
Azure AI Speech
An Azure service that integrates speech processing into apps and services.
1,554 questions
Azure AI services
Azure AI services
A group of Azure services, SDKs, and APIs designed to make apps more intelligent, engaging, and discoverable.
2,637 questions
0 comments No comments
{count} votes

Accepted answer
  1. Amira Bedhiafi 19,946 Reputation points
    2024-07-09T09:01:54.1466667+00:00

    I will split your post into 5 parts :

    How can you improve Speech-to-Text Accuracy for Industry-Specific Terminology?

    Improving the accuracy of speech-to-text systems for industry-specific terminology, particularly in a language like Japanese, involves creating a custom language model with Azure Cognitive Services' Speech to Text. The process starts with gathering industry-specific audio samples and their corresponding transcriptions. It is crucial to ensure these samples represent the variety of terms and contexts you'll encounter. Accurate transcriptions are essential, as errors in the training data can significantly imnpact the model's performance.

    How much training data is needed?

    The amount of training data required can vary, but here are some guidelines. Start with a baseline model to understand its performance on your specific terminology. Gradually add data and monitor improvements. For complex terminology and context differentiation, several hours of high-quality, annotated audio may be necessary. Aiming for 50-100 hours of diverse audio data can be a reasonable starting point. Continuously evaluate the model using a separate validation set to ensure improvements.

    How can you enhance Recognition of Specific Terms?

    To enhance the recognition of specific terms, use the Custom Speech feature to add industry-specific terms to the model's vocabulary. This helps the model recognize and correctly transcribe these terms. Additionally, utilize the Phrase List feature to boost the recognition of specific phrases and terms in particular contexts. Iterative training and testing are crucial; train the model iteratively, testing after each iteration to identify improvements and areas needing more data or better quality data.

    What is the minimum data requirement for a start?

    While the exact amount of data can vary, here’s a rough estimate to start with. Begin with 10-20 hours of high-quality annotated audio to understand baseline accuracy. Gradually increase this to 50-100 hours, monitoring improvements and making adjustments as necessary. This iterative process helps in gradually enhancing the model’s performance.

    What If Collecting Large Amounts of Data Is Challenging?

    If collecting large amounts of data is challenging, consider the following alternatives. Use pre-trained models available on Azure and fine-tune them with your data. Evaluate third-party speech-to-text services that might have better baseline performance for your specific needs. Combine automated transcription with human correction to balance cost and accuracy. These alternatives can help achieve the desired accuracy with potentially less effort in data collection.


0 additional answers

Sort by: Most helpful