Will Azure AI Speech generate styles such as "happy", "cheerful", "excited" automatically from the data given?

Question

I've added data with about 750 utterances. 80% are normal sentences, while 10% are questions and the other 10% are exclamations.

What will Speech Studio need to generate styles such as Happy, Cheerful, etc?
Do I have to give it more data? Or will it be enough to just have a clean set of 750 uttarences for it to generate such styles?

Is there additional pricing involved for styles? Kindly help.

Accepted Answer

Microsoft's Custom Neural Voice in Speech Studio uses machine learning to generate a unique voice. The quality of the voice and its ability to express different styles like "Happy", "Cheerful", etc., largely depends on the quality and diversity of the training data.

Data Quantity: While 750 utterances can be a good start, more data usually leads to better results. Microsoft recommends a minimum of 300-500 sentences for a draft voice, and 2000 sentences for a more natural-sounding voice.

Data Diversity: The utterances should cover a wide range of phonetic and prosodic variations. If you want the voice to express different styles, the training data should include examples of these styles.

Data Quality: The recordings should be clean, high-quality, and free of background noise. The speaker should have consistent pronunciation, volume, speed, and pitch.

As for pricing, the cost of using Custom Neural Voice is based on the amount of speech synthesized, not the styles. However, creating a high-quality custom voice that can express different styles might require more training data, which could increase the cost of data collection and preparation.

Remember to follow Microsoft's responsible AI guidelines when using Custom Neural Voice. You must have the necessary permissions from the speaker, and the use of the custom voice must comply with Microsoft's use case policy.

I hope this helps! If you have any further questions, feel free to ask.

If the information is useful, please accept the answer and upvote it to assist other community members.

Share via

Will Azure AI Speech generate styles such as "happy", "cheerful", "excited" automatically from the data given?

0 additional answers