Token Counting and Metrics

Relevant source files

The token counting and metrics system calculates character counts, token counts, and git-related metrics for packed repositories. This provides users with insights into AI model context window usage and resource consumption.

For overall packager orchestration, see Core Packager System. For worker pool infrastructure, see Worker-Based Processing Pipeline.

Overview

The metrics system provides:

Token counting via the gpt-tokenizer library for accurate LLM context window estimation src/core/metrics/TokenCounter.ts1-2
Character counting for all processed files src/core/metrics/calculateMetrics.ts28
Selective optimization to calculate token counts only for top files by default to reduce overhead.
Git metrics for diffs and commit logs when enabled in the configuration src/core/metrics/calculateMetrics.ts30-31
Fast-path output calculation that avoids re-tokenizing massive files by calculating the "wrapper" tokens around pre-calculated file contents src/core/metrics/calculateMetrics.ts120-140
Persistent Caching of token counts using content-addressed hashing to speed up repeated runs src/core/metrics/tokenCountCache.ts21-39
Warm-up Heuristics: Eagerly initializes workers based on whether a repository has been seen before to hide BPE initialization latency src/core/metrics/calculateMetrics.ts49-78

The system uses worker-based parallelization through initTaskRunner and selective calculation to minimize performance impact on large repositories.

Sources: src/core/metrics/TokenCounter.ts47-95 src/core/metrics/calculateMetrics.ts1-32 src/core/metrics/tokenCountCache.ts9-39

Core Architecture

The calculateMetrics function orchestrates all metrics calculation through a single shared worker pool and parallel execution of specialized calculation functions.

Calculation Pipeline Flow

Execution steps:

Background Cache Loading: pack kicks off loadTokenCountCache() in the background to overlap with file searching src/core/packager.ts87
Worker pool initialization: createMetricsTaskRunner initializes the pool. It checks for a "seen marker" to decide between a cold start (full warm-up) or warm-likely start (minimal warm-up) src/core/metrics/calculateMetrics.ts79-98
Immediate parallel execution: Starts file metrics, git diff, and git log metrics immediately so they overlap with output generation src/core/metrics/calculateMetrics.ts169-178
Output-dependent execution: Awaits the generated output. It attempts a "Fast Path" by extracting the template wrapper src/core/metrics/calculateMetrics.ts187-193
Results aggregation: Combines data into a CalculateMetricsResult src/core/metrics/calculateMetrics.ts24-32
Cleanup: Destroys the worker pool and saves the token cache to disk src/core/metrics/calculateMetrics.ts241-243 src/core/packager.ts223

Sources: src/core/metrics/calculateMetrics.ts39-205 src/core/packager.ts81-128 src/core/metrics/tokenCountCache.ts101-158

Token Counting Optimization

Fast-Path Output Calculation

Instead of re-tokenizing a massive output file, the system reuses pre-calculated token counts for individual files. The extractOutputWrapper function src/core/metrics/calculateMetrics.ts120-140 identifies string content between files (XML tags, Markdown headers). Only this "wrapper" is tokenized and added to the sum of file tokens. This optimization is triggered via canUseFastOutputTokenPath and is currently available for non-parsable XML, Markdown, and Plain styles src/core/metrics/calculateMetrics.ts142-147

Persistent Token Cache

Repomix implements a content-addressed disk cache for token counts src/core/metrics/tokenCountCache.ts1-39

Key Generation: Keys are built from ${encoding}:${byteLength}:${md5_16} src/core/metrics/tokenCountCache.ts37-38
Persistence: The cache is saved to the OS temporary directory with atomic renames to prevent corruption src/core/metrics/tokenCountCache.ts68-72
Capacity: Limited to 100,000 entries with FIFO eviction src/core/metrics/tokenCountCache.ts21
Seen Markers: A per-repo "seen" marker (MD5 of absolute root paths) is used to predict if a run will hit the cache, allowing the system to skip expensive worker pre-warming src/core/metrics/tokenCountCache.ts101-116

Sources: src/core/metrics/calculateMetrics.ts73-100 src/core/metrics/tokenCountCache.ts1-188

Gpt-tokenizer Integration

Token counting uses the gpt-tokenizer library via the TokenCounter class src/core/metrics/TokenCounter.ts47

Token Counting Architecture

TokenCounter Implementation

The TokenCounter class uses resolveEncodingAsync to lazily load BPE rank data src/core/metrics/TokenCounter.ts35-37 It treats all text as plain content by using PLAIN_TEXT_OPTIONS which disallows nothing, ensuring special tokens like <|endoftext|> are tokenized as ordinary text src/core/metrics/TokenCounter.ts16-18

Supported Encodings

The system supports several OpenAI encodings src/core/metrics/tokenEncodings.ts4-10:

o200k_base (GPT-4o, default)
cl100k_base (GPT-4)
p50k_base, p50k_edit, r50k_base

Sources: src/core/metrics/TokenCounter.ts15-74 src/core/metrics/tokenEncodings.ts1-12 src/core/metrics/calculateMetrics.ts40-44

Git and Output Metrics

Git Diff and Log Metrics

Git Diffs: Calculated if includeDiffs is true. The calculateGitDiffMetrics function is called within the pipeline src/core/metrics/calculateMetrics.ts177
Git Logs: Calculated if includeLogs is true. The calculateGitLogMetrics function src/core/metrics/calculateMetrics.ts178 handles tokenization of the commit history.

Output Metrics (Full Path)

When the "Fast Path" is unavailable (e.g., for JSON output or split files), calculateOutputMetrics is used src/core/metrics/calculateMetrics.ts203-215 This fallback ensures accuracy for parsable formats where file content might be escaped or transformed by the template engine.

Sources: src/core/metrics/calculateMetrics.ts169-215 src/core/metrics/calculateGitDiffMetrics.ts1-10 src/core/metrics/calculateGitLogMetrics.ts1-11

File Manipulation and Metrics

Before metrics are calculated, files may undergo manipulation (comment removal, empty line removal) which affects the final token count. The getFileManipulator src/core/file/fileManipulate.ts103-106 function selects an appropriate manipulator based on file extension.

Manipulator	Action	Supported Extensions
`StripCommentsManipulator`	Removes comments via `@repomix/strip-comments` src/core/file/fileManipulate.ts37-41	`.c`, `.h`, `.hpp`, `.cpp`, `.cc`, `.cxx`, `.cs`, `.css`, `.dart`, `.go`, `.html`, `.java`, `.js`, `.jsx`, `.kt`, `.less`, `.php`, `.py`, `.rb`, `.rs`, `.sass`, `.scss`, `.sh`, `.sol`, `.sql`, `.swift`, `.ts`, `.tsx`, `.xml`, `.yaml`, `.yml` src/core/file/fileManipulate.ts58-90
`CompositeManipulator`	Applies multiple language-specific strips src/core/file/fileManipulate.ts45-56	`.vue`, `.svelte` src/core/file/fileManipulate.ts91-100
`BaseManipulator`	Default (removes empty lines only) src/core/file/fileManipulate.ts15-26	Unknown extensions

Sources: src/core/file/fileManipulate.ts28-106 src/core/file/fileManipulate.ts15-26

Metrics Reporting

After calculation, metrics are reported to the user via specialized reporters.

Token Count Tree: If output.tokenCountTree is enabled, reportTokenCountTree displays a hierarchical view of the repository with token counts per file and directory tests/cli/reporters/tokenCountTreeReporter.test.ts27-53
Threshold Filtering: The tree reporter can filter entries based on a minimum token count threshold tests/cli/reporters/tokenCountTreeReporter.test.ts70-83
Memory Logging: The pipeline uses logMemoryUsage and withMemoryLogging to track resource consumption during metrics calculation src/core/packager.ts81-96

Sources: tests/cli/reporters/tokenCountTreeReporter.test.ts27-121 src/core/packager.ts23-37 src/core/metrics/calculateMetrics.ts160