Metrics and Observability

Relevant source files

This page documents how vLLM collects, aggregates, and exposes runtime metrics. It covers the v1 stats dataclasses, the logger hierarchy, Prometheus metrics, request latency breakdowns, and the integration with Ray and custom plugin loggers.

Architecture Overview

The metrics system consists of a stats collection layer that assembles data after each engine step and a logger layer that consumes those stats to write to standard output or expose them via a Prometheus-compatible endpoint.

Metrics are collected in the engine core during execution and processed by the frontend. In AsyncLLM, the _run_output_handler() loop receives EngineCoreOutputs, which the OutputProcessor uses to update IterationStats. These stats are then passed to a StatLoggerManager instance.

Data flow through the metrics system

Sources: vllm/v1/metrics/loggers.py44-64 vllm/v1/metrics/stats.py171-200 vllm/v1/metrics/stats.py303-330 vllm/v1/metrics/loggers.py1022-1050 vllm/v1/engine/output_processor.py138-143

StatLoggerManager Initialization

StatLoggerManager coordinates all registered stat loggers. It is instantiated during engine initialization in LLMEngine or AsyncLLM when log_stats=True. It manages a list of StatLoggerBase implementations and handles the loading of plugins via load_stat_logger_plugin_factories.

Initialization flow

Sources: vllm/v1/metrics/loggers.py74-88 vllm/v1/metrics/loggers.py1014-1025 vllm/v1/engine/output_processor.py176-177

Stats Collection Layer

Stats objects are defined in vllm/v1/metrics/stats.py as dataclasses to facilitate easy transfer between the engine core and the frontend.

SchedulerStats

Produced by the scheduler after each step, carrying the current state of resource allocation and cache usage.

Field	Type	Description
`num_running_reqs`	`int`	Requests currently in model execution batches.
`num_waiting_reqs`	`int`	Requests in the waiting queue.
`kv_cache_usage`	`float`	Fraction of KV cache blocks in use (0.0–1.0).
`prefix_cache_stats`	`PrefixCacheStats`	Local prefix cache hit/miss counts.
`spec_decoding_stats`	`SpecDecodingStats`	Statistics for speculative decoding performance.
`perf_stats`	`PerfStats`	MFU and hardware performance stats.
`kv_connector_stats`	`dict[str, Any]`	Stats for disaggregated KV transfer.
`cudagraph_stats`	`CUDAGraphStat`	CUDA Graph capture and usage statistics.
`waiting_lora_adapters`	`dict[str, int]`	Count of LoRA adapters in waiting requests.
`running_lora_adapters`	`dict[str, int]`	Count of LoRA adapters in running requests.

Sources: vllm/v1/metrics/stats.py171-200

IterationStats

Assembled by the OutputProcessor from a batch of EngineCoreOutput objects. It tracks token counts and finished request summaries.

Field	Type	Description
`num_generation_tokens`	`int`	Total generation tokens in this step.
`prompt_token_stats`	`PromptTokenStats`	Breakdown of prompt tokens by source (compute, local cache, or external transfer).
`finished_requests`	`list[FinishedRequestStats]`	Summary for completed requests in this iteration.
`num_corrupted_reqs`	`int`	Number of requests corrupted (e.g., due to NaNs).
`num_preempted_reqs`	`int`	Number of requests preempted.

Sources: vllm/v1/metrics/stats.py303-330

CachingMetrics

Maintains a sliding window hit rate over the most recent $N$ requests (default 1000).

Implementation: Uses a deque of (requests, queries, hits) tuples. vllm/v1/metrics/stats.py51-52
Hit Rate: Calculated as aggregated_query_hit / aggregated_query_total. vllm/v1/metrics/stats.py107-111
Reset: Triggered by BaseCacheStats.reset (e.g., after a manual cache clear). vllm/v1/metrics/stats.py68-69

Sources: vllm/v1/metrics/stats.py35-111

Logger Hierarchy

The system uses an abstract base class StatLoggerBase to define the interface for all loggers.

Sources: vllm/v1/metrics/loggers.py44-64 vllm/v1/metrics/loggers.py99-103 vllm/v1/metrics/loggers.py228-234

LoggingStatLogger

Writes human-readable summaries to the standard info log. It tracks throughput over a local logging interval and resets counters after each log() call. It also handles logging for SpecDecodingLogging, KVConnectorLogging, and CUDAGraphLogging.

Sources: vllm/v1/metrics/loggers.py99-134 vllm/v1/metrics/loggers.py161-190

PrometheusStatLogger

Registers and updates Prometheus metrics. It uses prometheus_client to expose Gauges, Counters, and Histograms.

Multiprocess Support: For deployments with multiple API servers, metrics are coordinated using the shared registry. docs/design/metrics.md83-86
MFU Metrics: Optionally records Model Flops Utilization (MFU) using PerfMetricsProm if enable_mfu_metrics is set in ObservabilityConfig. vllm/v1/metrics/loggers.py129-131 vllm/v1/metrics/perf.py193-202

Sources: vllm/v1/metrics/loggers.py228-234 vllm/v1/metrics/perf.py193-202 docs/design/metrics.md83-86

Request Latency Breakdown

The system tracks several latency intervals for each request. These are stored in RequestStateStats and summarized in FinishedRequestStats when a request completes.

Interval	Description	Code Entity
`queue_time`	Time spent in the waiting queue.	`FinishedRequestStats.queue_time`
`prefill_time`	Time from scheduling to first token generation.	`FinishedRequestStats.prefill_time`
`decode_time`	Time spent generating subsequent tokens.	`FinishedRequestStats.decode_time`
`e2e_latency`	Total time from arrival to completion.	`FinishedRequestStats.e2e_latency`

Sources: vllm/v1/metrics/stats.py202-218 vllm/v1/metrics/stats.py410-456

Performance and MFU Metrics

vLLM includes an analytic flops/memory estimation module. This is used to derive Model Flops Utilization (MFU) stats for a running model by parsing the model architecture and quantization configuration.

PerfStats: Dataclass containing num_flops_per_gpu, num_read_bytes_per_gpu, and num_write_bytes_per_gpu. vllm/v1/metrics/perf.py94-99
ParserChain: Uses a chain of parsers (e.g., AttentionQuantizationConfigParser, FfnQuantizationConfigParser) to calculate expected flops and memory bandwidth usage based on VllmConfig. vllm/v1/metrics/perf.py203-220
Quantization Impact: Weight byte sizes are adjusted based on the quantization method (e.g., 1 byte for FP8, 0.5 bytes for GPTQ/AWQ). vllm/v1/metrics/perf.py52-75

Sources: vllm/v1/metrics/perf.py52-100 vllm/v1/metrics/perf.py203-220

Ray Integration

When running in a Ray environment (e.g., Ray Serve), vLLM uses RayPrometheusStatLogger. This logger wraps Prometheus metrics using ray.util.metrics to ensure they are correctly associated with Ray replicas and Workers.

Ray Wrappers: RayGaugeWrapper, RayCounterWrapper, and RayHistogramWrapper provide the same API as prometheus_client but report to Ray's internal telemetry. vllm/v1/metrics/ray_wrappers.py85-170
Replica Identification: Metrics are automatically tagged with the ReplicaId retrieved from the Ray Serve context. vllm/v1/metrics/ray_wrappers.py21-29 vllm/v1/metrics/ray_wrappers.py57

Sources: vllm/v1/metrics/ray_wrappers.py21-29 vllm/v1/metrics/ray_wrappers.py85-170 vllm/v1/metrics/ray_wrappers.py203-210

Speculative Decoding Metrics

Speculative decoding performance is tracked via SpecDecodingStats. These stats are aggregated by SpecDecodingLogging (for stdout) and SpecDecodingProm (for Prometheus).

Key metrics include acceptance rates and acceptance lengths which help tune the speculative decoding parameters. Acceptance statistics are collected in the scheduler and passed through SchedulerStats.

Sources: vllm/v1/spec_decode/metrics.py17-45 vllm/v1/metrics/loggers.py115 vllm/v1/metrics/loggers.py183 vllm/v1/metrics/stats.py190

KV Cache Eviction Metrics

The KVCacheEvictionEvent dataclass captures details about KV cache block evictions, including lifetime_seconds, idle_seconds, and reuse_gaps_seconds. These events are collected in SchedulerStats and can be used to analyze KV cache efficiency and eviction policies.

Sources: vllm/v1/metrics/stats.py162-169 vllm/v1/metrics/stats.py188

MultiModalCacheStats tracks hit statistics for multi-modal data caching. It records the number of queries and hits for multi-modal data items (e.g., image or audio features). CachingMetrics can observe these stats to provide a hit rate.

Sources: vllm/v1/metrics/stats.py146-159 vllm/v1/metrics/loggers.py111

Plugin System for Custom Loggers

Users can implement custom loggers by subclassing StatLoggerBase and registering them via the vllm.stat_loggers entry point group.

Discovery: load_stat_logger_plugin_factories() searches for plugins at runtime using load_plugins_by_group(STAT_LOGGER_PLUGINS_GROUP). vllm/v1/metrics/loggers.py74-88
Validation: The system ensures all plugins are subclasses of StatLoggerBase. vllm/v1/metrics/loggers.py78-84
Execution: The StatLoggerManager calls record() and log() on all registered plugins in addition to default loggers. vllm/v1/metrics/loggers.py1022-1050

Sources: vllm/v1/metrics/loggers.py74-88 vllm/v1/metrics/loggers.py1022-1050 vllm/v1/metrics/loggers.py19