How to fix an issue where my 3D Blendshapes do not align with the audio.

Ananchai Mankhong 0 Reputation points
2024-07-08T07:21:23.1533333+00:00

I'm trying to apply viseme 3D Blend Shapes to drive my 3d avatar. 

When the result is returned, the audio plays before the response's FrameIndex and BlendShape.

I received event.animation and used it to set the weight for each blend shape name. 

However, during the first round, it couldn't parse to JSON, and I'm not sure why. In my opinion, 

it doesn't return event.animation in the first round, only event.audioOffset and event.visemeID,

but event.animation is empty. Below, you can see the issue.

Error parsing JSON: Error Domain=NSCocoaErrorDomain Code=3840 "Unable to parse empty data." UserInfo={NSDebugDescription=Unable to parse empty data.} Error parsing JSON: Error Domain=NSCocoaErrorDomain Code=3840 "Unable to parse empty data." UserInfo={NSDebugDescription=Unable to parse empty data.} .. .. .. Error parsing JSON: Error Domain=NSCocoaErrorDomain Code=3840 "Unable to parse empty data." UserInfo={NSDebugDescription=Unable to parse empty data.}

And then I got  index = 0 blendshapeName: eyeBlinkLeft value: 0.171 Morpher weight = 0.171 for blendshapeName: eyeBlinkLeft index = 1 blendshapeName: eyeLookDownLeft value: 0.164 Morpher weight = 0.164 for blendshapeName: eyeLookDownLeft .. .. .. index = 54 blendshapeName: rightEyeRoll value: 0.0 Morpher weight = 0.0 for blendshapeName: rightEyeRoll

And then I got  My 3D avatar with blend shape plays, but it doesn’t align with the audio.

I read your description

“ Each viseme event includes a series of frames in the Animation SDK property. These frames are grouped to best align the facial positions with the audio. Your 3D engine should render each group of BlendShapes frames immediately before the corresponding audio chunk. The FrameIndex value indicates how many frames preceded the current list of frames.

The output json looks like the following sample. Each frame within BlendShapes contains an array of 55 facial positions represented as decimal values between 0 to 1.

JSON

{

    "FrameIndex":0,

    "BlendShapes":[

        [0.021,0.321,...,0.258],

        [0.045,0.234,...,0.288],

        ...

    ]

}

The decimal values in the json response are in the same order as described in the following facial positions table. The order of BlendShapes is as follows. “

I think I can actually send the JSON immediately in the closure addVisemeReceivedEventHandler at event.animation.

This is a part of my code. Could you help me improve or fix this issue?  

Thank you very much.

 func synthesisToSpeaker() {
        guard let subscriptionKey = sub, let region = region else {
            print("Speech key and region are not set.")
            return
        }
        
        var speechConfig: SPXSpeechConfiguration?
        do {
            try speechConfig = SPXSpeechConfiguration(subscription: subscriptionKey, region: region)
        } catch {
            print("Error creating speech configuration: \(error)")
            return
        }
        
        speechConfig?.speechSynthesisVoiceName = "en-US-AvaMultilingualNeural"
        speechConfig?.setSpeechSynthesisOutputFormat(.raw16Khz16BitMonoPcm)
        
        guard let synthesizer = try? SPXSpeechSynthesizer(speechConfig!) else {
            print("Error creating speech synthesizer.")
            return
        }
        
        let ssml = """
               <speak version='1.0' xml:lang='en-US' xmlns='http://www.w3.org/2001/10/synthesis'
                          xmlns:mstts='http://www.w3.org/2001/mstts'>
                          <voice name='en-US-CoraNeural'>
                    <mstts:viseme type='FacialExpression'/>
                 Hello World, May I help you?
                </voice>
               </speak>
               """
        
        // Subscribe to viseme received event
        synthesizer.addVisemeReceivedEventHandler { (synthesizer, event) in
            self.mapBlendshapesToModel(jsonString: event.animation,
                                       node: self.contentNode)
           //print("\(event.animation)")
        }
        
        do {
            let result = try synthesizer.speakSsml(ssml)
            
            switch result.reason {
            case .recognizingSpeech:
                print("Synthesis recognizingSpeech")
            case .recognizedSpeech:
                print("Synthesis recognizedSpeech")
            case .synthesizingAudioCompleted:
                print("Synthesis synthesizingAudioCompleted")
            default:
                print("Synthesis failed: \(result.description)")
            }
        } catch {
            debugPrint("speakSsml failed")
        }
    }

func mapBlendshapesToModel(jsonString: String, node: SCNNode?) {
        guard let jsonData = jsonString.data(using: .utf8) else {
            print("Invalid JSON Data")
            return
        }
        
        guard let node = node else {
            print("Node is nil")
            return
        }
        
        do {
            let json = try JSONSerialization.jsonObject(with: jsonData, options: [])
            if let dictionary = json as? [String: Any] {
                if let frameIndex = dictionary["FrameIndex"] as? Int,
                   let blendShapes = dictionary["BlendShapes"] as? [[Double]] {
                    //setup my 3d
                }
            }
        } catch {
            print("Error parsing JSON: \(error)")
        }
    }
Azure AI Speech
Azure AI Speech
An Azure service that integrates speech processing into apps and services.
1,713 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Deleted

    This answer has been deleted due to a violation of our Code of Conduct. The answer was manually reported or identified through automated detection before action was taken. Please refer to our Code of Conduct for more information.


    Comments have been turned off. Learn more

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.