HunyuanVideo-Avatar is a multimodal diffusion transformer (MM-DiT) model by Tencent Hunyuan for animating static avatar images into dynamic, emotion-controllable, and multi-character dialogue videos, conditioned on audio. It addresses challenges of motion realism, identity consistency, and emotional alignment. Innovations include a character image injection module, an Audio Emotion Module for transferring emotion cues, and a Face-Aware Audio Adapter to isolate audio effects on faces, enabling multiple characters to be animated in a scene. Character image injection module for better consistency between training and inference conditioning. Emotion control by extracting emotion reference images and transferring emotional style into video sequences.

Features

  • Animates avatars (photorealistic, cartoon, rendered, anthropomorphic) across dynamic movement and backgrounds under audio cues
  • Emotion control by extracting emotion reference images and transferring emotional style into video sequences
  • Multi-character capability: supports more than one avatar in dialogue scenarios
  • Character image injection module for better consistency between training and inference conditioning
  • Face-Aware Audio Adapter (FAA) isolates audio effects through a latent face mask, enabling cross-attention control of multiple characters
  • High and scalable resource requirements: minimum and recommended GPU memory, supports variable resolutions and frame lengths

Project Samples

Project Activity

See All Activity >