Chat Utilities and Message Processing

Relevant source files

This page documents vLLM's chat message utilities, which are responsible for parsing, validating, and processing chat conversations. These utilities handle the transformation of OpenAI-style message structures into internal formats, multimodal content handling, and the application of Jinja2 chat templates via the Renderer system.

Overview of Message Processing Flow

The processing of a chat request follows a pipeline that moves from high-level API objects to low-level engine inputs. The primary entry points for this logic are parse_chat_messages and its asynchronous counterpart.

Title: Chat Processing Pipeline

Sources:

vllm/entrypoints/chat_utils.py vllm/entrypoints/chat_utils.py1008-1150
vllm/renderers/hf.py vllm/renderers/hf.py22-28
vllm/renderers/base.py vllm/renderers/base.py51-60

Multimodal Content Handling

vLLM supports complex multimodal inputs including images, audio, and video within chat messages. The system maps these inputs to specific internal placeholders to ensure they are correctly positioned during template rendering.

Modality Placeholders

Internal placeholders are used to mark the position of multimodal data in the text stream before the final prompt is constructed:

image: <##IMAGE##>
audio: <##AUDIO##>
video: <##VIDEO##>
prompt_embeds: <##PROMPT_EMBEDS##>

Sources:

vllm/entrypoints/chat_utils.py vllm/entrypoints/chat_utils.py83-88

MultiModalDataDict and UUIDs

Processed multimodal data is stored in a MultiModalDataDict, which maps modalities to lists of data (e.g., PIL.Image objects, tensors, or numpy arrays) vllm/inputs/__init__.py43 Additionally, MultiModalUUIDDict tracks unique identifiers for media items to support advanced caching and reference mechanisms vllm/entrypoints/chat_utils.py43

Title: Chat Data Structures

Sources:

vllm/entrypoints/chat_utils.py vllm/entrypoints/chat_utils.py43
vllm/entrypoints/chat_utils.py vllm/entrypoints/chat_utils.py981-1000

Multimodal Input Constraints

The system enforces limits on multimodal items per prompt via ModelConfig.limit_mm_per_prompt vllm/entrypoints/chat_utils.py1085-1090 If a request exceeds these limits, a VLLMValidationError is raised during the parsing stage vllm/entrypoints/chat_utils.py1091-1094

Sources:

vllm/entrypoints/chat_utils.py vllm/entrypoints/chat_utils.py1085-1094

Chat Template Processing

vLLM uses Jinja2 templates to convert a list of ConversationMessage objects into a single string prompt. This process is managed by the BaseRenderer and specialized subclasses.

Template Resolution

The resolve_chat_template function determines which template to use based on configuration. vLLM requires a chat template in the tokenizer configuration to support chat protocols vllm/entrypoints/chat_utils.py75-80

Specialized Renderers

Certain models require logic beyond standard Jinja2 templates or have specific tokenization requirements. vLLM implements specialized renderers:

DeepseekV32Renderer: Handles specific encoding and template application for DeepSeek V3.2 vllm/renderers/deepseek_v32.py23-63
Grok2Renderer: Implements custom rendering for Grok-2 vllm/renderers/grok2.py23-63
HfRenderer: The default renderer for most HuggingFace-compatible models, leveraging jinja2 for template logic vllm/renderers/hf.py51
MistralRenderer: Implements specialized rendering for Mistral models vllm/renderers/mistral.py1-50
TerratorchRenderer: Specialized renderer for geospatial Terratorch models vllm/renderers/terratorch.py1-40

Sources:

vllm/renderers/hf.py vllm/renderers/hf.py51
vllm/renderers/registry.py vllm/renderers/registry.py1-50
vllm/renderers/mistral.py vllm/renderers/mistral.py1-50

Implementation of parse_chat_messages

The parse_chat_messages function is the core logic for converting OpenAI ChatCompletionMessageParam into vLLM-compatible structures.

Execution Flow

Validation: Checks if the input messages conform to expected types and roles vllm/entrypoints/chat_utils.py1008-1015
Content Extraction: Iterates through message parts. If a part is an image_url, video_url, or audio_url, it is fetched/parsed and added to the MultiModalDataDict vllm/entrypoints/chat_utils.py1020-1060
Placeholder Replacement: Replaces multimodal parts with internal text placeholders based on the model's requirements vllm/entrypoints/chat_utils.py83-88
Interleaving: If interleave_mm_strings is enabled in ModelConfig, it ensures placeholders are correctly positioned relative to the text content rather than appended at the end vllm/entrypoints/chat_utils.py1115-1130

Sources:

vllm/entrypoints/chat_utils.py vllm/entrypoints/chat_utils.py1008-1150
vllm/entrypoints/chat_utils.py vllm/entrypoints/chat_utils.py1152-1200

Renderer Architecture

The BaseRenderer class provides the foundation for transforming chat messages into engine-ready inputs. It manages a ThreadPoolExecutor to offload blocking tokenization and multimodal preprocessing vllm/renderers/base.py88-93

Title: Renderer System Architecture

Key Components:

BaseRenderer: Abstract base class for all message renderers vllm/renderers/base.py74-90
AsyncMicrobatchTokenizer: Handles tokenization in the background to avoid blocking the event loop vllm/renderers/base.py146-152
BaseMultiModalProcessor: Processes raw media into tensors suitable for model input vllm/renderers/base.py98-106
ChatParams: Data class containing configuration to control how chat messages are parsed vllm/renderers/params.py72-126

Sources:

vllm/renderers/base.py vllm/renderers/base.py74-175
vllm/renderers/params.py vllm/renderers/params.py72-126

Tool Calling and Reasoning Parsing

vLLM integrates ToolParser and ReasoningParser into a unified Parser interface to handle structured model outputs during chat.

GPT-OSS Reasoning

For models like DeepSeek-R1 or GPT-OSS, vLLM provides specialized reasoning parsers (e.g., GPTOSSReasoningParser) to extract structured thought blocks from the raw output stream vllm/reasoning/gptoss_reasoning_parser.py1-50

Title: Output Parsing System

Sources:

vllm/entrypoints/openai/chat_completion/serving.py vllm/entrypoints/openai/chat_completion/serving.py125-140
vllm/reasoning/gptoss_reasoning_parser.py vllm/reasoning/gptoss_reasoning_parser.py1-100

Prompt Embeddings Handling

vLLM supports passing raw prompt_embeds directly through chat messages. This requires setting --enable-prompt-embeds vllm/entrypoints/chat_utils.py105-107

Processing Logic

Sentinel Insertion: The renderer inserts a PROMPT_EMBEDS_PLACEHOLDER_TOKEN into the prompt string vllm/renderers/hf.py102-126
Expansion: The 1-token sentinel is expanded into an N-token span matching the embedding tensor's sequence length vllm/renderers/hf.py149-161
Masking: A boolean mask is_token_ids is generated to distinguish between text tokens and embedded regions vllm/renderers/hf.py191-203

Sources:

vllm/renderers/hf.py vllm/renderers/hf.py102-210
vllm/entrypoints/chat_utils.py vllm/entrypoints/chat_utils.py167-177

Input Batching for GPU Execution

Once processed by the renderers, inputs are eventually batched for the GPU. The InputBatch and CachedRequestState objects in the V1 engine track the resulting tokens and multimodal features vllm/v1/worker/gpu_input_batch.py34-90

Key Fields:

prompt_is_token_ids: Aligned mask indicating if a position is a vocabulary token or an embedding vllm/v1/worker/gpu_input_batch.py57-58
mm_features: List of multimodal feature specifications for the model runner vllm/v1/worker/gpu_input_batch.py37

Sources:

vllm/v1/worker/gpu_input_batch.py vllm/v1/worker/gpu_input_batch.py34-181

Summary of Chat Utilities Functions

Function	Purpose	File
`parse_chat_messages`	Synchronous parsing of messages and multimodal data.	`vllm/entrypoints/chat_utils.py` vllm/entrypoints/chat_utils.py1008-1150
`parse_chat_messages_async`	Asynchronous parsing, supporting remote media fetching.	`vllm/entrypoints/chat_utils.py` vllm/entrypoints/chat_utils.py1152-1200
`resolve_chat_template`	Loads and resolves Jinja2 templates for the model.	`vllm/entrypoints/chat_utils.py` vllm/entrypoints/chat_utils.py26

Sources:

vllm/entrypoints/chat_utils.py vllm/entrypoints/chat_utils.py1008-1200

Chat Utilities and Message Processing

Overview of Message Processing Flow

Multimodal Content Handling

Modality Placeholders

MultiModalDataDict and UUIDs

Multimodal Input Constraints

Chat Template Processing

Template Resolution

Specialized Renderers

Implementation of parse_chat_messages

Execution Flow

Renderer Architecture

Tool Calling and Reasoning Parsing

GPT-OSS Reasoning

Prompt Embeddings Handling

Processing Logic

Input Batching for GPU Execution

Summary of Chat Utilities Functions

On this page