Alibaba has recently introduced a new AI model capable of turning a simple portrait photo and an audio clip into a film-quality digital human that can speak, sing, and even perform.

Alibaba Wan2.2 S2V (2)

The model is called ‘Wan2.2-S2V’ (Speech-to-Video) and, as one of the solutions within the Wan2.2 video generation series, it can deliver an output from different perspectives of portrait, bust, or even full body, that also factors in environmental elements based on prompts. That means creators aren’t just stuck with static faces, as they can build entire scenes with multiple characters moving naturally.

Deep down, it utilizes and blends ext-guided global motion control with audio-driven local movements for smooth and expressive animations, and through proper optimization, the model is capable of generating longer videos in a stable state.

Alibaba Wan2.2 S2V (1)

Resolution-wise, it supports 480p and 720p output, making it flexible enough for TikTok-style social media content or more polished professional presentations. And since it can handle all sorts of avatars from realistic humans to animals and stylized characters, the use cases are pretty wide open.

To train the system, Alibaba built a massive audio-visual dataset inspired by film and TV production. That’s why it works so well across different formats, from vertical short clips to cinematic widescreen videos.

If you want to try it out, Wan2.2-S2V is already up on Hugging Face, GitHub, and Alibaba’s ModelScope community.

Facebook
Twitter
LinkedIn
Pinterest

Leave a Reply

Related Posts

Subscribe via Email

Enter your email address to subscribe to Tech-Critter and receive notifications of new posts by email.