Alibaba releases Wan2.2-S2V open-source model that turns static images into video

Alibaba has recently introduced a new AI model capable of turning a simple portrait photo and an audio clip into a film-quality digital human that can speak, sing, and even perform.

The model is called ‘Wan2.2-S2V’ (Speech-to-Video) and, as one of the solutions within the Wan2.2 video generation series, it can deliver an output from different perspectives of portrait, bust, or even full body, that also factors in environmental elements based on prompts. That means creators aren’t just stuck with static faces, as they can build entire scenes with multiple characters moving naturally.

Deep down, it utilizes and blends ext-guided global motion control with audio-driven local movements for smooth and expressive animations, and through proper optimization, the model is capable of generating longer videos in a stable state.

Resolution-wise, it supports 480p and 720p output, making it flexible enough for TikTok-style social media content or more polished professional presentations. And since it can handle all sorts of avatars from realistic humans to animals and stylized characters, the use cases are pretty wide open.

To train the system, Alibaba built a massive audio-visual dataset inspired by film and TV production. That’s why it works so well across different formats, from vertical short clips to cinematic widescreen videos.

If you want to try it out, Wan2.2-S2V is already up on Hugging Face, GitHub, and Alibaba’s ModelScope community.

Calvin Liew

Ex-competitive rhythm gamer who is always the "Good but not the best". You'd know me as Vindy if you know where to look. Currently on a quest to own enough keyboards with different plates and just slapping MX Black on them.