Worker and Executor Architecture

Relevant source files

Purpose and Scope

This page documents the two-layer abstraction vLLM uses to manage model execution across hardware: the Executor layer and the Worker layer.

The Executor is owned by the engine core and acts as a proxy to one or more worker processes or actors. It handles process lifecycle, RPC dispatch, and result collection.
The Worker runs inside each process (or Ray actor) and is responsible for device initialization, model loading, KV cache allocation, and executing individual forward passes.

This page focuses on the v1 implementations under vllm/v1/executor/ and vllm/v1/worker/.

Architecture Overview

Two-layer model execution architecture

Sources: vllm/v1/executor/abstract.py37-92 vllm/v1/executor/uniproc_executor.py46-66 vllm/v1/executor/multiproc_executor.py102-187 vllm/v1/executor/ray_executor.py64-98

The Executor Abstraction

Executor (vllm/v1/executor/abstract.py37-177) is the abstract base class for all executor implementations. It is instantiated by the engine core and provides a uniform interface regardless of how many GPUs or processes are in use.

Backend Selection

Executor.get_class(vllm_config) reads parallel_config.distributed_executor_backend to pick the concrete class:

`distributed_executor_backend`	Executor Class	File
`"mp"`	`MultiprocExecutor`	`vllm/v1/executor/multiproc_executor.py`
`"ray"`	`RayDistributedExecutor`	`vllm/v1/executor/ray_executor.py`
`"uni"`	`UniProcExecutor`	`vllm/v1/executor/uniproc_executor.py`
`"external_launcher"`	`ExecutorWithExternalLauncher`	`vllm/v1/executor/uniproc_executor.py`

Sources: vllm/v1/executor/abstract.py47-92

The `collective_rpc` Interface

All executor implementations expose collective_rpc(), the core RPC primitive. It sends a method call (by name or serialized callable) with positional and keyword arguments to all workers, then collects a list of results.

Higher-level methods (execute_model, sample_tokens, initialize_from_config, etc.) are implemented in the base class by delegating to collective_rpc. Subclasses override them only when they need extra behavior (e.g., handling pipeline parallelism).

Sources: vllm/v1/executor/abstract.py153-177 vllm/v1/executor/abstract.py118-133

UniProcExecutor

vllm/v1/executor/uniproc_executor.py45-147

UniProcExecutor runs a single worker in the same process as the engine core. It is the default for single-GPU deployments.

_init_executor() creates one WorkerWrapperBase(rpc_rank=0), calls init_worker(), init_device(), and load_model() in-process vllm/v1/executor/uniproc_executor.py46-65
collective_rpc() calls run_method(self.driver_worker, method, args, kwargs) directly vllm/v1/executor/uniproc_executor.py80-106
It supports asynchronous scheduling by returning AsyncOutputFuture when non_block=True is passed to execution methods vllm/v1/executor/uniproc_executor.py26-43

ExecutorWithExternalLauncher

vllm/v1/executor/uniproc_executor.py149-186

A subclass of UniProcExecutor designed for torchrun-compatible launchers. Instead of one executor managing multiple workers, the user launches one engine per GPU. distributed_init_method is set to "env://" and rank/local_rank are read from RANK and LOCAL_RANK vllm/v1/executor/uniproc_executor.py174-185

MultiprocExecutor

vllm/v1/executor/multiproc_executor.py102-466

MultiprocExecutor spawns one worker subprocess per local GPU. It is selected when distributed_executor_backend="mp".

Worker Process Spawning

MultiprocExecutor initialization sequence

Sources: vllm/v1/executor/multiproc_executor.py109-225 vllm/v1/executor/multiproc_executor.py603-643 vllm/v1/executor/multiproc_executor.py712-810

Message Queue Communication

Each worker subprocess has communication channels (backed by shared memory) managed via MessageQueue vllm/v1/executor/multiproc_executor.py130-156

Queue	Direction	Purpose
`rpc_broadcast_mq`	Executor → all workers	Broadcast `(method, args, kwargs, output_rank)`
`worker_response_mq`	Worker → Executor	Return `(status, result)` for each call

The executor's collective_rpc() enqueues a call tuple on rpc_broadcast_mq, then dequeues from the appropriate response queues vllm/v1/executor/multiproc_executor.py317-389

RayDistributedExecutor

vllm/v1/executor/ray_executor.py64-643

RayDistributedExecutor distributes workers as Ray actors and uses Ray's Compiled DAG for the execution path.

Worker Creation

_init_workers_ray() vllm/v1/executor/ray_executor.py151-398:

Resolves bundle_indices from the Ray placement group.
Creates RayWorkerWrapper remote actors vllm/v1/executor/ray_executor.py215-225
Adjusts ranks via collective_rpc("adjust_rank", ...) vllm/v1/executor/ray_executor.py255-257
Sets CUDA_VISIBLE_DEVICES environment variables vllm/v1/executor/ray_executor.py270-276
Calls collective_rpc("init_worker", ...), "init_device", "load_model" on all actors vllm/v1/executor/ray_executor.py304-320

Execution via Compiled DAG

On the first execute_model() call, _compiled_ray_dag() builds a Ray CompiledDAG vllm/v1/executor/ray_executor.py542-635 It chains PP stages so intermediate tensors flow from PP rank 0 to PP rank N−1.

The Worker Abstraction

`WorkerBase`

vllm/v1/worker/worker_base.py38-177

WorkerBase defines the interface every worker implementation must fulfill. It stores the decomposed VllmConfig fields and rank information.

`WorkerWrapperBase`

vllm/v1/worker/worker_base.py179-372

WorkerWrapperBase sits between an executor and a WorkerBase instance. Its roles are:

Lazy initialization — creates the WorkerBase subclass only when init_worker() is called vllm/v1/worker/worker_base.py251-314
Worker class resolution — reads parallel_config.worker_cls and instantiates it via resolve_obj_by_qualname vllm/v1/worker/worker_base.py270-272
Method dispatch — execute_method(method, *args, **kwargs) calls run_method(self, ...) vllm/v1/worker/worker_base.py343-345

GPU Worker (`Worker`)

vllm/v1/worker/gpu_worker.py117-900

Worker is the concrete WorkerBase implementation for GPU execution. It coordinates between the distributed environment and the GPUModelRunner.

Class Relationships

Code entities in the GPU worker layer

Sources: vllm/v1/worker/gpu_worker.py117-164 vllm/v1/worker/worker_base.py39-83 vllm/v1/worker/worker_base.py187-215

Memory Profiling and Device Management

The Worker handles complex device state and memory profiling:

init_device(): Configures torch.set_float32_matmul_precision vllm/v1/worker/gpu_worker.py135-137 initializes distributed groups via init_distributed_environment() vllm/v1/worker/gpu_worker.py245-247 and creates the GPUModelRunner vllm/v1/worker/gpu_worker.py289-291
determine_available_memory(): Uses memory_profiling and model_runner.profile_run() to find the remaining bytes for KV cache vllm/v1/worker/gpu_worker.py353-502
sleep() / wake_up(): Interfaces with CuMemAllocator (via get_mem_allocator_instance) to offload model weights or KV cache to CPU memory vllm/v1/worker/gpu_worker.py165-190

Microbatching and SM Control

The GPU worker supports advanced execution modes like microbatching (ubatching) and SM (Streaming Multiprocessor) control to overlap communication and computation.

UBatchWrapper: Wraps a runnable (like the model forward pass) to handle execution across multiple ubatches.
SMControlContextManager: Controls SM allocation between communication (e.g., DeepEP for MoE) and computation (e.g., DeepGEMM). It sets the number of SMs for communication and computation upon entering the context.

Profiling Integration

The Worker manages performance profiling through self.profiler vllm/v1/worker/gpu_worker.py151-152 It supports both TorchProfilerWrapper and CudaProfilerWrapper based on profiler_config.profiler vllm/v1/worker/gpu_worker.py157-159 Profiling can be started/stopped via start_profile() and stop_profile() which wrap the underlying profiler's lifecycle vllm/v1/worker/gpu_worker.py610-637

Sources: vllm/v1/worker/gpu_worker.py146-152 vllm/v1/worker/gpu_worker.py610-637