Microsoft's Custom Neural Voice in Speech Studio uses machine learning to generate a unique voice. The quality of the voice and its ability to express different styles like "Happy", "Cheerful", etc., largely depends on the quality and diversity of the training data.
Data Quantity: While 750 utterances can be a good start, more data usually leads to better results. Microsoft recommends a minimum of 300-500 sentences for a draft voice, and 2000 sentences for a more natural-sounding voice.
Data Diversity: The utterances should cover a wide range of phonetic and prosodic variations. If you want the voice to express different styles, the training data should include examples of these styles.
Data Quality: The recordings should be clean, high-quality, and free of background noise. The speaker should have consistent pronunciation, volume, speed, and pitch.
As for pricing, the cost of using Custom Neural Voice is based on the amount of speech synthesized, not the styles. However, creating a high-quality custom voice that can express different styles might require more training data, which could increase the cost of data collection and preparation.
Remember to follow Microsoft's responsible AI guidelines when using Custom Neural Voice. You must have the necessary permissions from the speaker, and the use of the custom voice must comply with Microsoft's use case policy.
I hope this helps! If you have any further questions, feel free to ask.
If the information is useful, please accept the answer and upvote it to assist other community members.