Overview

Relevant source files

This page provides a high-level introduction to vLLM's architecture, core components, and design principles. It serves as an entry point for understanding how vLLM orchestrates large language model inference across multiple hardware platforms with optimized memory management and execution.

vLLM V1 represents a significant re-architecture of the core engine (scheduler, KV cache manager, worker, and sampler) to provide a cohesive, modular, and high-performance framework while retaining the stable model implementations and kernels from V0 docs/usage/v1_guide.md9-15

What is vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving README.md24 It optimizes throughput and memory efficiency through several key technologies:

PagedAttention: Efficient management of attention key and value memory that eliminates fragmentation README.md31
Continuous Batching: Dynamic request scheduling and iteration-level batching README.md32
Chunked Prefill: Enabled by default in V1, it processes large prefills in smaller chunks to eliminate pipeline bubbles and balance compute docs/usage/v1_guide.md37-39 README.md32
Speculative Decoding: Support for n-gram, suffix, EAGLE, and DFlash to accelerate generation README.md37
Multi-platform Support: Support for NVIDIA GPUs, AMD GPUs, Intel GPUs, TPUs, and CPUs README.md51 docs/usage/v1_guide.md86-93

Sources: README.md24-62 docs/usage/v1_guide.md9-39

System Architecture

vLLM follows a layered architecture with a clear separation of concerns. The V1 engine introduces a decoupled execution model where the EngineCore runs in a separate process from the API frontend to minimize CPU overhead and Python GIL contention docs/usage/v1_guide.md19-20

High-Level System Components

Title: "vLLM System Architecture (Natural Language to Code Entities)"

Layered Architecture Overview

Layer	Purpose	Key Components
External Interface	Entry points for users	`LLM`, `AsyncLLM`, OpenAI API server, `ServeSubcommand` vllm/entrypoints/llm.py66 vllm/v1/engine/async_llm.py70 vllm/entrypoints/cli/serve.py44
Configuration	Argument parsing and config assembly	`EngineArgs`, `VllmConfig`, `ModelConfig`, `ParallelConfig` vllm/v1/engine/core.py24 vllm/entrypoints/llm.py14-30
Engine Orchestration	Request lifecycle and IPC coordination	`EngineCore`, `InputProcessor`, `OutputProcessor` vllm/v1/engine/core.py94 vllm/v1/engine/input_processor.py36 vllm/v1/engine/async_llm.py135
Scheduling & Memory	Resource allocation and KV management	`Scheduler`, `KVCacheConfig` vllm/v1/core/sched/interface.py53 vllm/v1/kv_cache_interface.py78
Execution	Model forward passes on hardware	`Executor`, `WorkerBase` vllm/v1/executor/__init__.py10 vllm/v1/worker/worker_base.py40

Sources: vllm/v1/engine/core.py94-156 vllm/v1/engine/async_llm.py70-153 vllm/entrypoints/llm.py66-162 vllm/entrypoints/cli/serve.py44-149

Core Components

EngineCore: The Central Serving Loop

EngineCore is the high-performance inner loop of vLLM. It manages the Scheduler and the Executor vllm/v1/engine/core.py94-121 It initializes specialized configurations and hardware-specific optimizations.

Key responsibilities:

Initialization: Profiles GPU memory, initializes KV caches, and sets up the model executor and scheduler vllm/v1/engine/core.py121-156
Request Handling: Receives EngineCoreRequest objects (Add, Abort, LoRA commands) vllm/v1/engine/core.py63-65
Iteration Loop: Orchestrates the model forward pass and output collection vllm/v1/engine/core.py84
Multimodal Support: Integrates multimodal registries to handle inputs like images and audio vllm/v1/engine/core.py33 vllm/v1/engine/core.py166-169

EngineCoreClient and IPC

EngineCoreClient abstracts the communication between the frontend (API) and the backend (EngineCore) vllm/v1/engine/core_client.py71:

InprocClient: Runs EngineCore in the same process as the caller vllm/v1/engine/core_client.py105
SyncMPClient: ZMQ + background proc EngineCore for synchronous use cases vllm/v1/engine/core_client.py103
AsyncMPClient: ZMQ + background proc EngineCore with asyncio support for AsyncLLM vllm/v1/engine/core_client.py132
DPLBAsyncMPClient: Implements internal load balancing across multiple Data Parallel (DP) engine ranks vllm/v1/engine/core_client.py131

Sources: vllm/v1/engine/core_client.py71-132 vllm/v1/engine/core.py94-186

Request Processing Flow

The flow below demonstrates how a user request traverses the system from a high-level API call to GPU execution.

Title: "vLLM V1 Request Lifecycle (Code Entity Space)"

Request Lifecycle Stages:

Input Processing: InputProcessor validates SamplingParams and converts EngineInput into EngineCoreRequest vllm/v1/engine/async_llm.py135 vllm/v1/engine/input_processor.py36-82
Scheduling: Scheduler determines which requests enter the current batch based on token budgets and cache availability vllm/v1/engine/core.py150-158
Execution: Executor coordinates workers to run the model forward pass vllm/v1/engine/core.py123
Output Processing: OutputProcessor collects EngineCoreOutputs, updates request states, and handles detokenization vllm/v1/engine/async_llm.py138-143

Sources: vllm/v1/engine/core.py94-186 vllm/v1/engine/async_llm.py132-153

Optimized Model Execution: Fused MoE

vLLM utilizes advanced kernels for Mixture-of-Experts (MoE) models to maintain high performance. The FusedMoE layer architecture integrates gate, up, and down projections into a single operation where possible.

Title: "MoE Execution Pipeline (Code Entity Space)"

MoE Design Principles:

Modular Kernels: The architecture decouples the Router from Expert execution steps using MoEBackend configurations vllm/config/kernel.py75
Quantization Support: Native support for FP8, MXFP4, and NVFP4 through specialized backends vllm/config/vllm.py127-128
Hardware Acceleration: Integration with FlashInfer for specialized all-reduce fusion on Hopper/Blackwell GPUs vllm/config/vllm.py148-157

Sources: vllm/config/kernel.py75 vllm/config/vllm.py127-157

Optimization and Serving Modes

vLLM V1 is designed for extreme scale and performance:

Decoupled Serving: The engine can run in a "headless" mode where the API server and EngineCore are separate processes, coordinated by CoreEngineProcManager vllm/v1/engine/utils.py121-148
Distributed Coordination: Supports Data Parallelism (DP) with various load balancing strategies, including internal and external load balancers vllm/config/parallel.py146-159
Elastic Serving: Support for scaling Data Parallel size dynamically via scale_elastic_ep vllm/v1/engine/core_client.py209
Structured Outputs: Integrated StructuredOutputManager and grammar-based generation support vllm/v1/engine/core.py134

Sources: vllm/v1/engine/utils.py121-148 vllm/config/parallel.py146-159 vllm/v1/engine/core_client.py209 vllm/v1/engine/core.py134