This page documents vLLM's chat message utilities, which are responsible for parsing, validating, and processing chat conversations. These utilities handle the transformation of OpenAI-style message structures into internal formats, multimodal content handling, and the application of Jinja2 chat templates via the Renderer system.
The processing of a chat request follows a pipeline that moves from high-level API objects to low-level engine inputs. The primary entry points for this logic are parse_chat_messages and its asynchronous counterpart.
Title: Chat Processing Pipeline
Sources:
vllm/entrypoints/chat_utils.py vllm/entrypoints/chat_utils.py1008-1150vllm/renderers/hf.py vllm/renderers/hf.py22-28vllm/renderers/base.py vllm/renderers/base.py51-60vLLM supports complex multimodal inputs including images, audio, and video within chat messages. The system maps these inputs to specific internal placeholders to ensure they are correctly positioned during template rendering.
Internal placeholders are used to mark the position of multimodal data in the text stream before the final prompt is constructed:
image: <##IMAGE##>audio: <##AUDIO##>video: <##VIDEO##>prompt_embeds: <##PROMPT_EMBEDS##>Sources:
vllm/entrypoints/chat_utils.py vllm/entrypoints/chat_utils.py83-88Processed multimodal data is stored in a MultiModalDataDict, which maps modalities to lists of data (e.g., PIL.Image objects, tensors, or numpy arrays) vllm/inputs/__init__.py43 Additionally, MultiModalUUIDDict tracks unique identifiers for media items to support advanced caching and reference mechanisms vllm/entrypoints/chat_utils.py43
Title: Chat Data Structures
Sources:
vllm/entrypoints/chat_utils.py vllm/entrypoints/chat_utils.py43vllm/entrypoints/chat_utils.py vllm/entrypoints/chat_utils.py981-1000The system enforces limits on multimodal items per prompt via ModelConfig.limit_mm_per_prompt vllm/entrypoints/chat_utils.py1085-1090 If a request exceeds these limits, a VLLMValidationError is raised during the parsing stage vllm/entrypoints/chat_utils.py1091-1094
Sources:
vllm/entrypoints/chat_utils.py vllm/entrypoints/chat_utils.py1085-1094vLLM uses Jinja2 templates to convert a list of ConversationMessage objects into a single string prompt. This process is managed by the BaseRenderer and specialized subclasses.
The resolve_chat_template function determines which template to use based on configuration. vLLM requires a chat template in the tokenizer configuration to support chat protocols vllm/entrypoints/chat_utils.py75-80
Certain models require logic beyond standard Jinja2 templates or have specific tokenization requirements. vLLM implements specialized renderers:
DeepseekV32Renderer: Handles specific encoding and template application for DeepSeek V3.2 vllm/renderers/deepseek_v32.py23-63Grok2Renderer: Implements custom rendering for Grok-2 vllm/renderers/grok2.py23-63HfRenderer: The default renderer for most HuggingFace-compatible models, leveraging jinja2 for template logic vllm/renderers/hf.py51MistralRenderer: Implements specialized rendering for Mistral models vllm/renderers/mistral.py1-50TerratorchRenderer: Specialized renderer for geospatial Terratorch models vllm/renderers/terratorch.py1-40Sources:
vllm/renderers/hf.py vllm/renderers/hf.py51vllm/renderers/registry.py vllm/renderers/registry.py1-50vllm/renderers/mistral.py vllm/renderers/mistral.py1-50The parse_chat_messages function is the core logic for converting OpenAI ChatCompletionMessageParam into vLLM-compatible structures.
image_url, video_url, or audio_url, it is fetched/parsed and added to the MultiModalDataDict vllm/entrypoints/chat_utils.py1020-1060interleave_mm_strings is enabled in ModelConfig, it ensures placeholders are correctly positioned relative to the text content rather than appended at the end vllm/entrypoints/chat_utils.py1115-1130Sources:
vllm/entrypoints/chat_utils.py vllm/entrypoints/chat_utils.py1008-1150vllm/entrypoints/chat_utils.py vllm/entrypoints/chat_utils.py1152-1200The BaseRenderer class provides the foundation for transforming chat messages into engine-ready inputs. It manages a ThreadPoolExecutor to offload blocking tokenization and multimodal preprocessing vllm/renderers/base.py88-93
Title: Renderer System Architecture
Key Components:
BaseRenderer: Abstract base class for all message renderers vllm/renderers/base.py74-90AsyncMicrobatchTokenizer: Handles tokenization in the background to avoid blocking the event loop vllm/renderers/base.py146-152BaseMultiModalProcessor: Processes raw media into tensors suitable for model input vllm/renderers/base.py98-106ChatParams: Data class containing configuration to control how chat messages are parsed vllm/renderers/params.py72-126Sources:
vllm/renderers/base.py vllm/renderers/base.py74-175vllm/renderers/params.py vllm/renderers/params.py72-126vLLM integrates ToolParser and ReasoningParser into a unified Parser interface to handle structured model outputs during chat.
For models like DeepSeek-R1 or GPT-OSS, vLLM provides specialized reasoning parsers (e.g., GPTOSSReasoningParser) to extract structured thought blocks from the raw output stream vllm/reasoning/gptoss_reasoning_parser.py1-50
Title: Output Parsing System
Sources:
vllm/entrypoints/openai/chat_completion/serving.py vllm/entrypoints/openai/chat_completion/serving.py125-140vllm/reasoning/gptoss_reasoning_parser.py vllm/reasoning/gptoss_reasoning_parser.py1-100vLLM supports passing raw prompt_embeds directly through chat messages. This requires setting --enable-prompt-embeds vllm/entrypoints/chat_utils.py105-107
PROMPT_EMBEDS_PLACEHOLDER_TOKEN into the prompt string vllm/renderers/hf.py102-126is_token_ids is generated to distinguish between text tokens and embedded regions vllm/renderers/hf.py191-203Sources:
vllm/renderers/hf.py vllm/renderers/hf.py102-210vllm/entrypoints/chat_utils.py vllm/entrypoints/chat_utils.py167-177Once processed by the renderers, inputs are eventually batched for the GPU. The InputBatch and CachedRequestState objects in the V1 engine track the resulting tokens and multimodal features vllm/v1/worker/gpu_input_batch.py34-90
Key Fields:
prompt_is_token_ids: Aligned mask indicating if a position is a vocabulary token or an embedding vllm/v1/worker/gpu_input_batch.py57-58mm_features: List of multimodal feature specifications for the model runner vllm/v1/worker/gpu_input_batch.py37Sources:
vllm/v1/worker/gpu_input_batch.py vllm/v1/worker/gpu_input_batch.py34-181| Function | Purpose | File |
|---|---|---|
parse_chat_messages | Synchronous parsing of messages and multimodal data. | vllm/entrypoints/chat_utils.py vllm/entrypoints/chat_utils.py1008-1150 |
parse_chat_messages_async | Asynchronous parsing, supporting remote media fetching. | vllm/entrypoints/chat_utils.py vllm/entrypoints/chat_utils.py1152-1200 |
resolve_chat_template | Loads and resolves Jinja2 templates for the model. | vllm/entrypoints/chat_utils.py vllm/entrypoints/chat_utils.py26 |
Sources:
vllm/entrypoints/chat_utils.py vllm/entrypoints/chat_utils.py1008-1200Refresh this wiki
This wiki was recently refreshed. Please wait 6 days to refresh again.