Is there a way to control the desired duration of a speech using Text to Speech from Azure Cognitive Services?

Victor Kzam 1 Reputation point
2020-11-08T11:04:27.477+00:00

Hello there. I am building an application to recognize speech, transform into text, translate and then regenerate that speech in a different language – this process with the supervision of a human to increase the accuracy rate.

However, I am finding it very difficult to control the output of audio, specially in terms of the duration of each sentence or paragraph. Due to this documentation, the application is generating speeches in other languages using <prosody duration="XXXXms"> for each sentence.

However, the output does not came as desired. Using an example, I have a file containing 22 paragraphs that should take 03min43s to speak the desired output, however the application seems to ignore the fact that each paragraph is arranged like the following example <p><prosody duration="4000ms">O objeto.</prosody></p> and takes the natural time to generate the audio – which results in a 02min22s file.

Any ideas why this might be happening? And what should I do to avoid this outcome?

Thank you very much for any help that you guys may provide.

Azure AI services
Azure AI services
A group of Azure services, SDKs, and APIs designed to make apps more intelligent, engaging, and discoverable.
2,782 questions
{count} votes

1 answer

Sort by: Most helpful
  1. GiftA-MSFT 11,166 Reputation points
    2020-11-17T09:13:21.61+00:00

    Duration supports only standard voices, please ensure you are using standard voice for your scenario. Thanks.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.