I need to create a real-time 2D conversational avatar using custom portrait image

Question

I want to create a webpage with a conversational 2D AI avatar using cognitive services with custom portrait image. Is it possible with Azure ?

Answer

Hello @SpandanB

Thanks for reaching out to us, I think you want to do a AI avatar which will act the conversation? Is this correct?

If yes, you can connect the two APIs to accomplish the target - Azure OpenAI and Azure Speech.

For conversational, the first part will be a chat bot, which you can leverage Language Service or OpenAI Service.

https://video2.skills-academy.com/en-us/azure/ai-services/openai/chatgpt-quickstart?tabs=command-line%2Cpython-new&pivots=programming-language-studio

For Avatar, you can consider Text to Speech feature.

If yes, please check on below document about how to make a "conversational" avatar, it can be 3D or 2D.- https://techcommunity.microsoft.com/t5/ai-azure-ai-services-blog/azure-ai-speech-announces-public-preview-of-text-to-speech/ba-p/3981448

What is text to speech avatar?

The text to speech avatar system is a text to speech feature with vision capabilities, that allow customers to create synthetic videos of a 2D photorealistic avatar speaking. The Neural text to speech Avatar models are trained by deep neural networks based on the human video recording samples, and the voice of the avatar is provided by text to speech voice model.

Why do we build avatars? There are two main reasons:

Traditional video content creation requires a lot of time and budget, including setting up video shooting environment, filming videos, editing, etc. With text to speech avatar, users can more efficiently create video. Users can use the avatar to build training videos, product introductions, customer testimonials, etc., simply with text input.
With the release of Azure OpenAI Service and neural text to speech, interactive conversation is more natural than before. With text to speech avatar, the users can create more engaging digital interactions. You can use the avatar to build conversational agents, virtual assistants, chatbots, and more.

There are three components in an avatar content generation workflow: text analyzer, the TTS audio synthesizer, and TTS avatar video synthesizer. To generate avatar video, text is first input into the text analyzer, which provides the output in the form of phoneme sequence. Then, the TTS audio synthesizer predicts the acoustic features of the input text and synthesize the voice. These two parts are provided by text to speech voice models. Next, the Neural text to speech Avatar model predicts the image of lip sync with the acoustic features, so that the synthetic video is generated.

Below is an overview of the workflow:

thumbnail image 1 of blog post titled Azure AI Speech announces public preview of text to speech avatar

Please take a look and have a try, I hope this helps.

Regards,

Yutong

-Please kindly accept the answer if you feel helpful to support the community, thanks a lot.

Share via

I need to create a real-time 2D conversational avatar using custom portrait image

1 answer