Cartesia Sonic-3.5
Sonic 3.5 is Cartesia’s fastest, most natural text-to-speech model, built for expressive, real-time voice generation with sub-90ms latency and native support for 42 languages. It is designed to follow transcripts faithfully, voice confirmation codes, and heteronyms correctly without preprocessing, and stay expressive enough to carry a real conversation. It supports languages intended to deliver native-quality speech. Sonic 3.5 focuses on clean audio across every language and voice, with no artifacts to edit out, making it practical for production voice experiences where quality, speed, and consistency matter. Its expressive conversational delivery provides strong pacing and real emotional range, tuned for support and agent transcripts. Alphanumerics such as order numbers, phone numbers, IDs, and emails are spoken naturally in every language, while context-aware English pronunciation helps words like read, bass, and bow land correctly from the surrounding text.
Learn more
TML-interaction-small
TML-Interaction-Small is a real-time multimodal interaction model developed by Thinking Machines Lab to enable more natural and collaborative human-AI communication across audio, video, and text. Unlike traditional turn-based AI systems that rely on external scaffolding and delayed interactions, TML-Interaction-Small is designed around continuous micro-turn exchanges that allow the model to perceive, respond, listen, speak, and react simultaneously in real time. The model uses a time-aware architecture that processes 200ms interaction windows, enabling seamless interruptions, simultaneous speech, visual cue detection, and live collaborative workflows without requiring separate dialog management systems. TML-Interaction-Small supports capabilities such as real-time conversation, proactive interjections, live translation, visual monitoring, tool usage, browsing, and asynchronous reasoning through coordination with a background model.
Learn more
Cartesia Sonic-3
Cartesia Sonic-3 is a real-time, streaming text-to-speech (TTS) model designed to generate ultra-realistic, expressive voice output with extremely low latency, enabling AI systems to speak as fluidly as humans in live interactions. Built on advanced state space model architecture, Sonic delivers high-quality speech while achieving near-instant response times, with audio generation beginning in as little as 40–100 milliseconds, making conversations feel seamless rather than delayed. It is optimized for conversational AI use cases, acting as the “voice layer” for AI agents by converting text into natural-sounding speech that includes emotional nuance such as excitement, empathy, or even laughter. It supports more than 40 languages with native-level voices and accent localization, allowing developers to build globally accessible applications with consistent quality across regions.
Learn more
GPT-Realtime-2
GPT-Realtime-2 is OpenAI’s voice model for live interactions where the model can keep the conversation moving while it reasons through requests, calls tools, handles corrections or interruptions, and responds in a way that fits the moment. It is built for a new class of voice apps that feel more natural, respond more intelligently, and take action in real time. GPT-Realtime-2 brings GPT-5-class reasoning to voice experiences, helping agents understand what someone means, track context, recover when a request changes, use tools while the conversation continues, and carry the conversation forward naturally. Developers can enable short preambles like “let me check that” so users know the agent is working, and the model can call multiple tools at once while making actions audible with phrases like “checking your calendar” or “looking that up now.” It also has stronger recovery behavior, longer context for agentic workflows, better retention of specialized terminology, etc.
Learn more