Viseme Event time offsets in Custom Neural Voice are weird.

현우 오 181 Reputation points
2021-10-12T07:17:56.333+00:00

I found the Viseme Event time offsets in Custom Neural Voice are strange. (ko-KR in use)
The following cases are the results of outputting the Visems Event from the voice synthesized by the same Text with different VoiceNames (InJoonNeural, OHW_Neural).

Below is InJoonNeural provided by the existing Neural Voice.

InJoonNeural

  1. Viseme : 0, Time : 50ms
  2. Viseme : 20, Time : 50ms
  3. Viseme : 4, Time : 350ms
  4. Viseme : 21, Time : 450ms
  5. Viseme : 2, Time : 525ms
  6. Viseme : 19, Time : 625ms
  7. Viseme : 12, Time : 650ms
  8. Viseme : 4, Time : 650ms
  9. Viseme : 6, Time : 775ms
  10. Viseme : 8, Time : 806ms
  11. Viseme : 0, Time : 50ms

Below is the Custom Neural Voice, OHW_Neural.

OHW_Neural

  1. Viseme : 0, Time : 50ms
  2. Viseme : 20, Time : 100ms
  3. Viseme : 4, Time : 225ms
  4. Viseme : 21, Time : 325ms
  5. Viseme : 2, Time : 375ms
  6. Viseme : 19, Time : 437ms
  7. Viseme : 4, Time : 650ms
  8. Viseme : 6, Time : 700ms
  9. Viseme : 8, Time : 893ms
  10. Viseme : 0, Time : 1087ms

Compared with InJoonNeural, the time between No. 6 and No. 8 of InJoonNeural is 25 ms, while the time between No. 6 and No. 7 of the corresponding OHW_Neural is 113 ms, showing a large difference.

When comparing this with the directly synthesized wav file, I found that the Viseme Event is being output when the Viseme Event should not appear as the voice is almost finished. (From No. 7 of OHW_Neural)

Is there any way to improve the problems mentioned above?
I wonder if this problem can be improved if the pronunciation of the training data used in Custom Neural Voice is correct.

Azure AI Speech
Azure AI Speech
An Azure service that integrates speech processing into apps and services.
1,684 questions
{count} votes

Accepted answer
  1. GiftA-MSFT 11,166 Reputation points
    2021-10-26T18:04:24.097+00:00

    Hi, following up. Different voice have different speaking rates. So, viseme time can't be compared between voices. For the issue "When comparing this with the directly synthesized wav file, I found that the Viseme Event is being output when the Viseme Event should not appear as the voice is almost finished. (From No. 7 of OHW_Neural)", please redeploy your endpoint. You will use the latest code by redeploy endpoint.


    --- *Kindly Accept Answer if the information helps. Thanks.*

    0 comments No comments

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.