This document describes the sampling and token generation subsystem in vLLM, focusing on the V1 engine architecture. It covers how logits produced by the model forward pass are transformed into discrete token IDs through various sampling strategies, penalties, and structured constraints. This includes the configuration via SamplingParams, the role of logprobs processing, and the execution flow in the Sampler, including specialized mechanisms like TopKTopPSampler and RejectionSampler.
The sampling subsystem consists of several key components that coordinate to transform model logits into token IDs. In the V1 engine, the Sampler class handles the core logic, while specialized modules handle penalties, logprobs, and speculative decoding.
The following diagram maps logical sampling stages to the specific code entities responsible for them in the V1 engine.
Sources: vllm/v1/worker/gpu/sample/sampler.py30-76 vllm/v1/sample/ops/topk_topp_sampler.py70-95 vllm/v1/worker/gpu/input_batch.py36-96 vllm/v1/sample/rejection_sampler.py37-58
Each request specifies its sampling behavior through the SamplingParams class. This object defines everything from basic randomness to complex penalties and structured output requirements.
SamplingParamstemperature, top_p, top_k, min_p, and seed vllm/sampling_params.py204-222presence_penalty, frequency_penalty, and repetition_penalty vllm/sampling_params.py193-203stop, stop_token_ids, min_tokens, and bad_words vllm/sampling_params.py233-248StructuredOutputsParams, supporting JSON, Regex, and grammars vllm/sampling_params.py70-83The SamplingStates and RequestState in the V1 worker manage the GPU-side state for these parameters. RequestState specifically maintains UVA-backed tensors for all_token_ids and total_len to support penalties and logprobs without excessive GPU memory consumption vllm/v1/worker/gpu/states.py33-53
Sources: vllm/sampling_params.py198-250 vllm/v1/worker/gpu/states.py9-82
The Sampler class implements the primary token selection logic. It is invoked by the GPUModelRunner after the model forward pass.
add_request, which populates the PenaltiesState, LogitBiasState, and BadWordsState vllm/v1/worker/gpu/sample/sampler.py56-63apply_staged_writes before the forward pass vllm/v1/worker/gpu/sample/sampler.py65-70apply_sampling_params converts logits to float32 and applies logit biases, penalties, and bad words masking in-place vllm/v1/worker/gpu/sample/sampler.py156-177sample method applies temperature scaling and filtering (top-k/top-p) via optimized kernels vllm/v1/worker/gpu/sample/sampler.py179-206compute_topk_logprobs extracts the top-k logprobs and specific token logprobs from the processed logits vllm/v1/worker/gpu/sample/sampler.py110-118The TopKTopPSampler dynamically selects the most optimized kernel based on the hardware:
| Backend | Condition |
|---|---|
| FlashInfer | Enabled via VLLM_USE_FLASHINFER_SAMPLER on compatible CUDA devices vllm/v1/sample/ops/topk_topp_sampler.py38-56 |
| aiter | Optimized sampler for ROCm platforms using rocm_aiter_ops vllm/v1/sample/ops/topk_topp_sampler.py112-119 |
| CPU | Optimized implementation for x86/ARM, falling back to native for RISCV/PowerPC vllm/v1/sample/ops/topk_topp_sampler.py96-104 |
| XPU | Intel XPU optimized kernel via VLLM_XPU_USE_SAMPLER_KERNEL vllm/v1/sample/ops/topk_topp_sampler.py106-109 |
| Native PyTorch | Fallback for other cases or when specific intermediate logprobs are required vllm/v1/sample/ops/topk_topp_sampler.py123-145 |
Sources: vllm/v1/worker/gpu/sample/sampler.py72-144 vllm/v1/sample/ops/topk_topp_sampler.py70-130
The sampling system handles both full-vocab top-k logprobs and specific token ID logprobs via LogprobTokenIdsState.
Sources: vllm/v1/worker/gpu/sample/sampler.py104-118 vllm/v1/worker/gpu/sample/logprob.py20-23 vllm/v1/outputs.py53
When speculative decoding is enabled, the RejectionSampler validates draft tokens against the target model's logits using the algorithm from arXiv:2211.17192.
Sampler vllm/v1/sample/rejection_sampler.py129-142SpeculativeConfig vllm/v1/sample/rejection_sampler.py73-86Sources: vllm/v1/sample/rejection_sampler.py37-165
Sampling in the V1 engine is integrated with CUDA Graphs via ModelCudaGraphManager.
num_tokens, num_reqs) and CUDAGraphMode vllm/v1/worker/gpu/cudagraph_utils.py52-61Refresh this wiki
This wiki was recently refreshed. Please wait 6 days to refresh again.