This page documents how vLLM collects, aggregates, and exposes runtime metrics. It covers the v1 stats dataclasses, the logger hierarchy, Prometheus metrics, request latency breakdowns, and the integration with Ray and custom plugin loggers.
The metrics system consists of a stats collection layer that assembles data after each engine step and a logger layer that consumes those stats to write to standard output or expose them via a Prometheus-compatible endpoint.
Metrics are collected in the engine core during execution and processed by the frontend. In AsyncLLM, the _run_output_handler() loop receives EngineCoreOutputs, which the OutputProcessor uses to update IterationStats. These stats are then passed to a StatLoggerManager instance.
Data flow through the metrics system
Sources: vllm/v1/metrics/loggers.py44-64 vllm/v1/metrics/stats.py171-200 vllm/v1/metrics/stats.py303-330 vllm/v1/metrics/loggers.py1022-1050 vllm/v1/engine/output_processor.py138-143
StatLoggerManager coordinates all registered stat loggers. It is instantiated during engine initialization in LLMEngine or AsyncLLM when log_stats=True. It manages a list of StatLoggerBase implementations and handles the loading of plugins via load_stat_logger_plugin_factories.
Initialization flow
Sources: vllm/v1/metrics/loggers.py74-88 vllm/v1/metrics/loggers.py1014-1025 vllm/v1/engine/output_processor.py176-177
Stats objects are defined in vllm/v1/metrics/stats.py as dataclasses to facilitate easy transfer between the engine core and the frontend.
Produced by the scheduler after each step, carrying the current state of resource allocation and cache usage.
| Field | Type | Description |
|---|---|---|
num_running_reqs | int | Requests currently in model execution batches. |
num_waiting_reqs | int | Requests in the waiting queue. |
kv_cache_usage | float | Fraction of KV cache blocks in use (0.0–1.0). |
prefix_cache_stats | PrefixCacheStats | Local prefix cache hit/miss counts. |
spec_decoding_stats | SpecDecodingStats | Statistics for speculative decoding performance. |
perf_stats | PerfStats | MFU and hardware performance stats. |
kv_connector_stats | dict[str, Any] | Stats for disaggregated KV transfer. |
cudagraph_stats | CUDAGraphStat | CUDA Graph capture and usage statistics. |
waiting_lora_adapters | dict[str, int] | Count of LoRA adapters in waiting requests. |
running_lora_adapters | dict[str, int] | Count of LoRA adapters in running requests. |
Sources: vllm/v1/metrics/stats.py171-200
Assembled by the OutputProcessor from a batch of EngineCoreOutput objects. It tracks token counts and finished request summaries.
| Field | Type | Description |
|---|---|---|
num_generation_tokens | int | Total generation tokens in this step. |
prompt_token_stats | PromptTokenStats | Breakdown of prompt tokens by source (compute, local cache, or external transfer). |
finished_requests | list[FinishedRequestStats] | Summary for completed requests in this iteration. |
num_corrupted_reqs | int | Number of requests corrupted (e.g., due to NaNs). |
num_preempted_reqs | int | Number of requests preempted. |
Sources: vllm/v1/metrics/stats.py303-330
Maintains a sliding window hit rate over the most recent $N$ requests (default 1000).
deque of (requests, queries, hits) tuples. vllm/v1/metrics/stats.py51-52aggregated_query_hit / aggregated_query_total. vllm/v1/metrics/stats.py107-111BaseCacheStats.reset (e.g., after a manual cache clear). vllm/v1/metrics/stats.py68-69Sources: vllm/v1/metrics/stats.py35-111
The system uses an abstract base class StatLoggerBase to define the interface for all loggers.
Sources: vllm/v1/metrics/loggers.py44-64 vllm/v1/metrics/loggers.py99-103 vllm/v1/metrics/loggers.py228-234
Writes human-readable summaries to the standard info log. It tracks throughput over a local logging interval and resets counters after each log() call. It also handles logging for SpecDecodingLogging, KVConnectorLogging, and CUDAGraphLogging.
Sources: vllm/v1/metrics/loggers.py99-134 vllm/v1/metrics/loggers.py161-190
Registers and updates Prometheus metrics. It uses prometheus_client to expose Gauges, Counters, and Histograms.
PerfMetricsProm if enable_mfu_metrics is set in ObservabilityConfig. vllm/v1/metrics/loggers.py129-131 vllm/v1/metrics/perf.py193-202Sources: vllm/v1/metrics/loggers.py228-234 vllm/v1/metrics/perf.py193-202 docs/design/metrics.md83-86
The system tracks several latency intervals for each request. These are stored in RequestStateStats and summarized in FinishedRequestStats when a request completes.
| Interval | Description | Code Entity |
|---|---|---|
queue_time | Time spent in the waiting queue. | FinishedRequestStats.queue_time |
prefill_time | Time from scheduling to first token generation. | FinishedRequestStats.prefill_time |
decode_time | Time spent generating subsequent tokens. | FinishedRequestStats.decode_time |
e2e_latency | Total time from arrival to completion. | FinishedRequestStats.e2e_latency |
Sources: vllm/v1/metrics/stats.py202-218 vllm/v1/metrics/stats.py410-456
vLLM includes an analytic flops/memory estimation module. This is used to derive Model Flops Utilization (MFU) stats for a running model by parsing the model architecture and quantization configuration.
num_flops_per_gpu, num_read_bytes_per_gpu, and num_write_bytes_per_gpu. vllm/v1/metrics/perf.py94-99AttentionQuantizationConfigParser, FfnQuantizationConfigParser) to calculate expected flops and memory bandwidth usage based on VllmConfig. vllm/v1/metrics/perf.py203-220Sources: vllm/v1/metrics/perf.py52-100 vllm/v1/metrics/perf.py203-220
When running in a Ray environment (e.g., Ray Serve), vLLM uses RayPrometheusStatLogger. This logger wraps Prometheus metrics using ray.util.metrics to ensure they are correctly associated with Ray replicas and Workers.
RayGaugeWrapper, RayCounterWrapper, and RayHistogramWrapper provide the same API as prometheus_client but report to Ray's internal telemetry. vllm/v1/metrics/ray_wrappers.py85-170ReplicaId retrieved from the Ray Serve context. vllm/v1/metrics/ray_wrappers.py21-29 vllm/v1/metrics/ray_wrappers.py57Sources: vllm/v1/metrics/ray_wrappers.py21-29 vllm/v1/metrics/ray_wrappers.py85-170 vllm/v1/metrics/ray_wrappers.py203-210
Speculative decoding performance is tracked via SpecDecodingStats. These stats are aggregated by SpecDecodingLogging (for stdout) and SpecDecodingProm (for Prometheus).
Key metrics include acceptance rates and acceptance lengths which help tune the speculative decoding parameters. Acceptance statistics are collected in the scheduler and passed through SchedulerStats.
Sources: vllm/v1/spec_decode/metrics.py17-45 vllm/v1/metrics/loggers.py115 vllm/v1/metrics/loggers.py183 vllm/v1/metrics/stats.py190
The KVCacheEvictionEvent dataclass captures details about KV cache block evictions, including lifetime_seconds, idle_seconds, and reuse_gaps_seconds. These events are collected in SchedulerStats and can be used to analyze KV cache efficiency and eviction policies.
Sources: vllm/v1/metrics/stats.py162-169 vllm/v1/metrics/stats.py188
MultiModalCacheStats tracks hit statistics for multi-modal data caching. It records the number of queries and hits for multi-modal data items (e.g., image or audio features). CachingMetrics can observe these stats to provide a hit rate.
Sources: vllm/v1/metrics/stats.py146-159 vllm/v1/metrics/loggers.py111
Users can implement custom loggers by subclassing StatLoggerBase and registering them via the vllm.stat_loggers entry point group.
load_stat_logger_plugin_factories() searches for plugins at runtime using load_plugins_by_group(STAT_LOGGER_PLUGINS_GROUP). vllm/v1/metrics/loggers.py74-88StatLoggerBase. vllm/v1/metrics/loggers.py78-84StatLoggerManager calls record() and log() on all registered plugins in addition to default loggers. vllm/v1/metrics/loggers.py1022-1050Sources: vllm/v1/metrics/loggers.py74-88 vllm/v1/metrics/loggers.py1022-1050 vllm/v1/metrics/loggers.py19
Refresh this wiki
This wiki was recently refreshed. Please wait 6 days to refresh again.