The token counting and metrics system calculates character counts, token counts, and git-related metrics for packed repositories. This provides users with insights into AI model context window usage and resource consumption.
For overall packager orchestration, see Core Packager System. For worker pool infrastructure, see Worker-Based Processing Pipeline.
The metrics system provides:
gpt-tokenizer library for accurate LLM context window estimation src/core/metrics/TokenCounter.ts1-2The system uses worker-based parallelization through initTaskRunner and selective calculation to minimize performance impact on large repositories.
Sources: src/core/metrics/TokenCounter.ts47-95 src/core/metrics/calculateMetrics.ts1-32 src/core/metrics/tokenCountCache.ts9-39
The calculateMetrics function orchestrates all metrics calculation through a single shared worker pool and parallel execution of specialized calculation functions.
Execution steps:
pack kicks off loadTokenCountCache() in the background to overlap with file searching src/core/packager.ts87createMetricsTaskRunner initializes the pool. It checks for a "seen marker" to decide between a cold start (full warm-up) or warm-likely start (minimal warm-up) src/core/metrics/calculateMetrics.ts79-98CalculateMetricsResult src/core/metrics/calculateMetrics.ts24-32Sources: src/core/metrics/calculateMetrics.ts39-205 src/core/packager.ts81-128 src/core/metrics/tokenCountCache.ts101-158
Instead of re-tokenizing a massive output file, the system reuses pre-calculated token counts for individual files. The extractOutputWrapper function src/core/metrics/calculateMetrics.ts120-140 identifies string content between files (XML tags, Markdown headers). Only this "wrapper" is tokenized and added to the sum of file tokens. This optimization is triggered via canUseFastOutputTokenPath and is currently available for non-parsable XML, Markdown, and Plain styles src/core/metrics/calculateMetrics.ts142-147
Repomix implements a content-addressed disk cache for token counts src/core/metrics/tokenCountCache.ts1-39
${encoding}:${byteLength}:${md5_16} src/core/metrics/tokenCountCache.ts37-38Sources: src/core/metrics/calculateMetrics.ts73-100 src/core/metrics/tokenCountCache.ts1-188
Token counting uses the gpt-tokenizer library via the TokenCounter class src/core/metrics/TokenCounter.ts47
The TokenCounter class uses resolveEncodingAsync to lazily load BPE rank data src/core/metrics/TokenCounter.ts35-37 It treats all text as plain content by using PLAIN_TEXT_OPTIONS which disallows nothing, ensuring special tokens like <|endoftext|> are tokenized as ordinary text src/core/metrics/TokenCounter.ts16-18
The system supports several OpenAI encodings src/core/metrics/tokenEncodings.ts4-10:
o200k_base (GPT-4o, default)cl100k_base (GPT-4)p50k_base, p50k_edit, r50k_baseSources: src/core/metrics/TokenCounter.ts15-74 src/core/metrics/tokenEncodings.ts1-12 src/core/metrics/calculateMetrics.ts40-44
includeDiffs is true. The calculateGitDiffMetrics function is called within the pipeline src/core/metrics/calculateMetrics.ts177includeLogs is true. The calculateGitLogMetrics function src/core/metrics/calculateMetrics.ts178 handles tokenization of the commit history.When the "Fast Path" is unavailable (e.g., for JSON output or split files), calculateOutputMetrics is used src/core/metrics/calculateMetrics.ts203-215 This fallback ensures accuracy for parsable formats where file content might be escaped or transformed by the template engine.
Sources: src/core/metrics/calculateMetrics.ts169-215 src/core/metrics/calculateGitDiffMetrics.ts1-10 src/core/metrics/calculateGitLogMetrics.ts1-11
Before metrics are calculated, files may undergo manipulation (comment removal, empty line removal) which affects the final token count. The getFileManipulator src/core/file/fileManipulate.ts103-106 function selects an appropriate manipulator based on file extension.
| Manipulator | Action | Supported Extensions |
|---|---|---|
StripCommentsManipulator | Removes comments via @repomix/strip-comments src/core/file/fileManipulate.ts37-41 | .c, .h, .hpp, .cpp, .cc, .cxx, .cs, .css, .dart, .go, .html, .java, .js, .jsx, .kt, .less, .php, .py, .rb, .rs, .sass, .scss, .sh, .sol, .sql, .swift, .ts, .tsx, .xml, .yaml, .yml src/core/file/fileManipulate.ts58-90 |
CompositeManipulator | Applies multiple language-specific strips src/core/file/fileManipulate.ts45-56 | .vue, .svelte src/core/file/fileManipulate.ts91-100 |
BaseManipulator | Default (removes empty lines only) src/core/file/fileManipulate.ts15-26 | Unknown extensions |
Sources: src/core/file/fileManipulate.ts28-106 src/core/file/fileManipulate.ts15-26
After calculation, metrics are reported to the user via specialized reporters.
output.tokenCountTree is enabled, reportTokenCountTree displays a hierarchical view of the repository with token counts per file and directory tests/cli/reporters/tokenCountTreeReporter.test.ts27-53logMemoryUsage and withMemoryLogging to track resource consumption during metrics calculation src/core/packager.ts81-96Sources: tests/cli/reporters/tokenCountTreeReporter.test.ts27-121 src/core/packager.ts23-37 src/core/metrics/calculateMetrics.ts160
Refresh this wiki