Sampling and Token Generation

Relevant source files

Purpose and Scope

This document describes the sampling and token generation subsystem in vLLM, focusing on the V1 engine architecture. It covers how logits produced by the model forward pass are transformed into discrete token IDs through various sampling strategies, penalties, and structured constraints. This includes the configuration via SamplingParams, the role of logprobs processing, and the execution flow in the Sampler, including specialized mechanisms like TopKTopPSampler and RejectionSampler.

Sampling Components Overview

The sampling subsystem consists of several key components that coordinate to transform model logits into token IDs. In the V1 engine, the Sampler class handles the core logic, while specialized modules handle penalties, logprobs, and speculative decoding.

Natural Language to Code Entity Mapping: Sampling Pipeline

The following diagram maps logical sampling stages to the specific code entities responsible for them in the V1 engine.

Sources: vllm/v1/worker/gpu/sample/sampler.py30-76 vllm/v1/sample/ops/topk_topp_sampler.py70-95 vllm/v1/worker/gpu/input_batch.py36-96 vllm/v1/sample/rejection_sampler.py37-58

SamplingParams and State Management

Each request specifies its sampling behavior through the SamplingParams class. This object defines everything from basic randomness to complex penalties and structured output requirements.

Key Parameters in `SamplingParams`

Randomness: temperature, top_p, top_k, min_p, and seed vllm/sampling_params.py204-222
Penalties: presence_penalty, frequency_penalty, and repetition_penalty vllm/sampling_params.py193-203
Constraints: stop, stop_token_ids, min_tokens, and bad_words vllm/sampling_params.py233-248
Structured Outputs: Encapsulated in StructuredOutputsParams, supporting JSON, Regex, and grammars vllm/sampling_params.py70-83

The SamplingStates and RequestState in the V1 worker manage the GPU-side state for these parameters. RequestState specifically maintains UVA-backed tensors for all_token_ids and total_len to support penalties and logprobs without excessive GPU memory consumption vllm/v1/worker/gpu/states.py33-53

Sources: vllm/sampling_params.py198-250 vllm/v1/worker/gpu/states.py9-82

Sampler Implementation and Execution Flow

The Sampler class implements the primary token selection logic. It is invoked by the GPUModelRunner after the model forward pass.

Execution Steps

Request Handling: Requests are added to the sampler via add_request, which populates the PenaltiesState, LogitBiasState, and BadWordsState vllm/v1/worker/gpu/sample/sampler.py56-63
Staged Writes: Parameters are committed to GPU/UVA memory via apply_staged_writes before the forward pass vllm/v1/worker/gpu/sample/sampler.py65-70
Logits Processing: apply_sampling_params converts logits to float32 and applies logit biases, penalties, and bad words masking in-place vllm/v1/worker/gpu/sample/sampler.py156-177
Token Selection: The sample method applies temperature scaling and filtering (top-k/top-p) via optimized kernels vllm/v1/worker/gpu/sample/sampler.py179-206
Logprobs Computation: If requested, compute_topk_logprobs extracts the top-k logprobs and specific token logprobs from the processed logits vllm/v1/worker/gpu/sample/sampler.py110-118

Backend Selection (TopKTopPSampler)

The TopKTopPSampler dynamically selects the most optimized kernel based on the hardware:

Backend	Condition
FlashInfer	Enabled via `VLLM_USE_FLASHINFER_SAMPLER` on compatible CUDA devices vllm/v1/sample/ops/topk_topp_sampler.py38-56
aiter	Optimized sampler for ROCm platforms using `rocm_aiter_ops` vllm/v1/sample/ops/topk_topp_sampler.py112-119
CPU	Optimized implementation for x86/ARM, falling back to native for RISCV/PowerPC vllm/v1/sample/ops/topk_topp_sampler.py96-104
XPU	Intel XPU optimized kernel via `VLLM_XPU_USE_SAMPLER_KERNEL` vllm/v1/sample/ops/topk_topp_sampler.py106-109
Native PyTorch	Fallback for other cases or when specific intermediate logprobs are required vllm/v1/sample/ops/topk_topp_sampler.py123-145

Sources: vllm/v1/worker/gpu/sample/sampler.py72-144 vllm/v1/sample/ops/topk_topp_sampler.py70-130

Logprobs and Data Flow

The sampling system handles both full-vocab top-k logprobs and specific token ID logprobs via LogprobTokenIdsState.

Logprobs Data Flow

Sources: vllm/v1/worker/gpu/sample/sampler.py104-118 vllm/v1/worker/gpu/sample/logprob.py20-23 vllm/v1/outputs.py53

Speculative Rejection Sampling

When speculative decoding is enabled, the RejectionSampler validates draft tokens against the target model's logits using the algorithm from arXiv:2211.17192.

Forward Flow

Bonus Tokens: It first samples "bonus tokens" from the target model's logits using the standard Sampler vllm/v1/sample/rejection_sampler.py129-142
Target Logits Processing: It applies logits processors and sampling constraints to target logits before validation vllm/v1/sample/rejection_sampler.py157-165
Acceptance Logic: Tokens are accepted based on the relationship between the draft and target probabilities. If all proposed tokens are accepted, a bonus token is appended vllm/v1/sample/rejection_sampler.py42-57
Synthetic Mode: Supports "synthetic" rejection sampling using unconditional-to-conditional rate conversion based on SpeculativeConfig vllm/v1/sample/rejection_sampler.py73-86

Sources: vllm/v1/sample/rejection_sampler.py37-165

CUDA Graph Management

Sampling in the V1 engine is integrated with CUDA Graphs via ModelCudaGraphManager.

Capture: Graphs are captured for specific batch shapes (num_tokens, num_reqs) and CUDAGraphMode vllm/v1/worker/gpu/cudagraph_utils.py52-61
Dispatch: During execution, the manager selects the smallest captured graph that can accommodate the current batch size vllm/v1/worker/gpu/cudagraph_utils.py75-92
Uniformity: It optimizes for uniform batches where all requests have the same token count vllm/v1/worker/gpu/cudagraph_utils.py95-108

Sources: vllm/v1/worker/gpu/cudagraph_utils.py52-111