This page documents the two-layer abstraction vLLM uses to manage model execution across hardware: the Executor layer and the Worker layer.
This page focuses on the v1 implementations under vllm/v1/executor/ and vllm/v1/worker/.
Two-layer model execution architecture
Sources: vllm/v1/executor/abstract.py37-92 vllm/v1/executor/uniproc_executor.py46-66 vllm/v1/executor/multiproc_executor.py102-187 vllm/v1/executor/ray_executor.py64-98
Executor (vllm/v1/executor/abstract.py37-177) is the abstract base class for all executor implementations. It is instantiated by the engine core and provides a uniform interface regardless of how many GPUs or processes are in use.
Executor.get_class(vllm_config) reads parallel_config.distributed_executor_backend to pick the concrete class:
distributed_executor_backend | Executor Class | File |
|---|---|---|
"mp" | MultiprocExecutor | vllm/v1/executor/multiproc_executor.py |
"ray" | RayDistributedExecutor | vllm/v1/executor/ray_executor.py |
"uni" | UniProcExecutor | vllm/v1/executor/uniproc_executor.py |
"external_launcher" | ExecutorWithExternalLauncher | vllm/v1/executor/uniproc_executor.py |
Sources: vllm/v1/executor/abstract.py47-92
collective_rpc InterfaceAll executor implementations expose collective_rpc(), the core RPC primitive. It sends a method call (by name or serialized callable) with positional and keyword arguments to all workers, then collects a list of results.
Higher-level methods (execute_model, sample_tokens, initialize_from_config, etc.) are implemented in the base class by delegating to collective_rpc. Subclasses override them only when they need extra behavior (e.g., handling pipeline parallelism).
Sources: vllm/v1/executor/abstract.py153-177 vllm/v1/executor/abstract.py118-133
vllm/v1/executor/uniproc_executor.py45-147
UniProcExecutor runs a single worker in the same process as the engine core. It is the default for single-GPU deployments.
_init_executor() creates one WorkerWrapperBase(rpc_rank=0), calls init_worker(), init_device(), and load_model() in-process vllm/v1/executor/uniproc_executor.py46-65collective_rpc() calls run_method(self.driver_worker, method, args, kwargs) directly vllm/v1/executor/uniproc_executor.py80-106AsyncOutputFuture when non_block=True is passed to execution methods vllm/v1/executor/uniproc_executor.py26-43vllm/v1/executor/uniproc_executor.py149-186
A subclass of UniProcExecutor designed for torchrun-compatible launchers. Instead of one executor managing multiple workers, the user launches one engine per GPU. distributed_init_method is set to "env://" and rank/local_rank are read from RANK and LOCAL_RANK vllm/v1/executor/uniproc_executor.py174-185
vllm/v1/executor/multiproc_executor.py102-466
MultiprocExecutor spawns one worker subprocess per local GPU. It is selected when distributed_executor_backend="mp".
MultiprocExecutor initialization sequence
Sources: vllm/v1/executor/multiproc_executor.py109-225 vllm/v1/executor/multiproc_executor.py603-643 vllm/v1/executor/multiproc_executor.py712-810
Each worker subprocess has communication channels (backed by shared memory) managed via MessageQueue vllm/v1/executor/multiproc_executor.py130-156
| Queue | Direction | Purpose |
|---|---|---|
rpc_broadcast_mq | Executor → all workers | Broadcast (method, args, kwargs, output_rank) |
worker_response_mq | Worker → Executor | Return (status, result) for each call |
The executor's collective_rpc() enqueues a call tuple on rpc_broadcast_mq, then dequeues from the appropriate response queues vllm/v1/executor/multiproc_executor.py317-389
vllm/v1/executor/ray_executor.py64-643
RayDistributedExecutor distributes workers as Ray actors and uses Ray's Compiled DAG for the execution path.
_init_workers_ray() vllm/v1/executor/ray_executor.py151-398:
bundle_indices from the Ray placement group.RayWorkerWrapper remote actors vllm/v1/executor/ray_executor.py215-225collective_rpc("adjust_rank", ...) vllm/v1/executor/ray_executor.py255-257CUDA_VISIBLE_DEVICES environment variables vllm/v1/executor/ray_executor.py270-276collective_rpc("init_worker", ...), "init_device", "load_model" on all actors vllm/v1/executor/ray_executor.py304-320On the first execute_model() call, _compiled_ray_dag() builds a Ray CompiledDAG vllm/v1/executor/ray_executor.py542-635 It chains PP stages so intermediate tensors flow from PP rank 0 to PP rank N−1.
WorkerBasevllm/v1/worker/worker_base.py38-177
WorkerBase defines the interface every worker implementation must fulfill. It stores the decomposed VllmConfig fields and rank information.
WorkerWrapperBasevllm/v1/worker/worker_base.py179-372
WorkerWrapperBase sits between an executor and a WorkerBase instance. Its roles are:
WorkerBase subclass only when init_worker() is called vllm/v1/worker/worker_base.py251-314parallel_config.worker_cls and instantiates it via resolve_obj_by_qualname vllm/v1/worker/worker_base.py270-272execute_method(method, *args, **kwargs) calls run_method(self, ...) vllm/v1/worker/worker_base.py343-345Worker)vllm/v1/worker/gpu_worker.py117-900
Worker is the concrete WorkerBase implementation for GPU execution. It coordinates between the distributed environment and the GPUModelRunner.
Code entities in the GPU worker layer
Sources: vllm/v1/worker/gpu_worker.py117-164 vllm/v1/worker/worker_base.py39-83 vllm/v1/worker/worker_base.py187-215
The Worker handles complex device state and memory profiling:
init_device(): Configures torch.set_float32_matmul_precision vllm/v1/worker/gpu_worker.py135-137 initializes distributed groups via init_distributed_environment() vllm/v1/worker/gpu_worker.py245-247 and creates the GPUModelRunner vllm/v1/worker/gpu_worker.py289-291determine_available_memory(): Uses memory_profiling and model_runner.profile_run() to find the remaining bytes for KV cache vllm/v1/worker/gpu_worker.py353-502sleep() / wake_up(): Interfaces with CuMemAllocator (via get_mem_allocator_instance) to offload model weights or KV cache to CPU memory vllm/v1/worker/gpu_worker.py165-190The GPU worker supports advanced execution modes like microbatching (ubatching) and SM (Streaming Multiprocessor) control to overlap communication and computation.
UBatchWrapper: Wraps a runnable (like the model forward pass) to handle execution across multiple ubatches.SMControlContextManager: Controls SM allocation between communication (e.g., DeepEP for MoE) and computation (e.g., DeepGEMM). It sets the number of SMs for communication and computation upon entering the context.The Worker manages performance profiling through self.profiler vllm/v1/worker/gpu_worker.py151-152 It supports both TorchProfilerWrapper and CudaProfilerWrapper based on profiler_config.profiler vllm/v1/worker/gpu_worker.py157-159 Profiling can be started/stopped via start_profile() and stop_profile() which wrap the underlying profiler's lifecycle vllm/v1/worker/gpu_worker.py610-637
Sources: vllm/v1/worker/gpu_worker.py146-152 vllm/v1/worker/gpu_worker.py610-637
Refresh this wiki
This wiki was recently refreshed. Please wait 6 days to refresh again.