Version History¶

This page summarizes all VoxCPM releases, including feature comparison, version highlights, and migration guidance.

Quick Comparison¶

Feature	VoxCPM 1.0	VoxCPM 1.5	VoxCPM 2
Parameters	640M	800M	2B
Audio Output	16kHz	44.1kHz	48kHz
Languages	2 (zh, en)	2 (zh, en)	30
Patch Size	2	4	4
LM Token Rate	12.5Hz	6.25Hz	6.25Hz
Max Sequence Length	4096	4096	8192
Residual LM Fusion	Additive	Additive	Concat + Projection
DiT Conditioning	Single token (add)	Single token (add)	Multi-token (concat)
Reference Audio	Prompt continuation	Prompt continuation	Isolated ref channel
Voice Design	—	—	✅
Style Control	—	—	✅
SFT / LoRA	✅	✅	✅
RTF (RTX 4090)	~0.17	~0.15	~0.3

For a detailed explanation of the architecture components (four-stage pipeline, AudioVAE, Local DiT), see Architecture.

VoxCPM 2¶

VoxCPM 2 is the latest major release — a 2B parameter model trained on 2.36 million hours of multilingual data. It represents a significant leap in capacity, quality, and controllability over the 1.x series.

Key characteristics:

48kHz audio output via AudioVAE V2 (asymmetric 16kHz encode → 48kHz decode)
30-language multilingual support
Voice Design: create a voice from natural-language description, no reference audio needed
Style Control: control emotion, pace, and speaking style of a cloned voice via text tags
Isolated reference channel for voice cloning (no matching transcript required)
Concat-Projection residual LM fusion and multi-token DiT conditioning for richer expressiveness
Built on a MiniCPM-4 backbone

Use VoxCPM 2 for all new projects. It is the recommended default for multilingual synthesis, voice cloning, voice design, and production deployment.

VoxCPM 1.5¶

VoxCPM 1.5 is the final 1.x upgrade before VoxCPM 2. It improves audio quality and efficiency while keeping the core context-aware generation and zero-shot voice cloning workflow familiar to existing 1.x users.

Key characteristics:

44.1kHz output
6.25Hz LM token rate
patch size increased from 2 to 4
simpler migration path for existing VoxCPM 1.0 users

Use VoxCPM 1.5 when you want a lighter Chinese/English checkpoint than VoxCPM 2, while keeping stronger output quality than VoxCPM 1.0.

VoxCPM 1.0¶

VoxCPM 1.0 is the original tokenizer-free VoxCPM release. It remains useful as the baseline reference point for the family and for older experiments built around the original 0.5B checkpoint.

Key characteristics:

600M parameter size
16kHz output
original VoxCPM architecture release
benchmark reference for early VoxCPM results

Use VoxCPM 1.0 when you need the smallest historical checkpoint or want to compare against the original baseline behavior.

Migration Guidance¶

New projects should start with VoxCPM 2.
Existing VoxCPM 1.0 users should generally move to VoxCPM 1.5 first if they need a lower-risk 1.x upgrade path.
If you need multilingual synthesis, Voice Design, Style Control, or 48kHz output, move directly to VoxCPM 2.

Detailed Pages¶

Full VoxCPM 2 page: VoxCPM 2
Full VoxCPM 1.5 page: VoxCPM 1.5
Full VoxCPM 1.0 page: VoxCPM 1.0