I found the Viseme Event time offsets in Custom Neural Voice are strange. (ko-KR in use)
The following cases are the results of outputting the Visems Event from the voice synthesized by the same Text with different VoiceNames (InJoonNeural, OHW_Neural).
Below is InJoonNeural provided by the existing Neural Voice.
InJoonNeural
- Viseme : 0, Time : 50ms
- Viseme : 20, Time : 50ms
- Viseme : 4, Time : 350ms
- Viseme : 21, Time : 450ms
- Viseme : 2, Time : 525ms
- Viseme : 19, Time : 625ms
- Viseme : 12, Time : 650ms
- Viseme : 4, Time : 650ms
- Viseme : 6, Time : 775ms
- Viseme : 8, Time : 806ms
- Viseme : 0, Time : 50ms
Below is the Custom Neural Voice, OHW_Neural.
OHW_Neural
- Viseme : 0, Time : 50ms
- Viseme : 20, Time : 100ms
- Viseme : 4, Time : 225ms
- Viseme : 21, Time : 325ms
- Viseme : 2, Time : 375ms
- Viseme : 19, Time : 437ms
- Viseme : 4, Time : 650ms
- Viseme : 6, Time : 700ms
- Viseme : 8, Time : 893ms
- Viseme : 0, Time : 1087ms
Compared with InJoonNeural, the time between No. 6 and No. 8 of InJoonNeural is 25 ms, while the time between No. 6 and No. 7 of the corresponding OHW_Neural is 113 ms, showing a large difference.
When comparing this with the directly synthesized wav file, I found that the Viseme Event is being output when the Viseme Event should not appear as the voice is almost finished. (From No. 7 of OHW_Neural)
Is there any way to improve the problems mentioned above?
I wonder if this problem can be improved if the pronunciation of the training data used in Custom Neural Voice is correct.