AI Eval Forge: Mixed-Check Regression Testing for LLM and Agent Workflows

Katta, Mukunda Rao

doi:10.5281/zenodo.20044318

Published May 5, 2026 | Version v1

Publication Open

AI Eval Forge: Mixed-Check Regression Testing for LLM and Agent Workflows

Katta, Mukunda Rao

Large-model and agent teams often need faster regression checks than broad benchmark suites can provide. This paper presents AI Eval Forge, a zero-dependency evaluation harness for mixed-check regression testing across LLM and agent workflows. The tool supports exact-match, substring, regex, token-F1, JSON validity, JSON field equality, citation coverage, and bounded custom-expression checks in a compact case format that works with JSON or JSONL. The contribution is not a new benchmark. It is a small, inspectable evaluation layer that helps teams compare runs, catch regressions, and summarize pass rate, score, cost, and latency without standing up a heavy evaluation stack. The paper describes the harness design, check model, reporting format, and practical role of mixed-check cases in real workflow testing. The artifact bundle is connected to the ai-eval-forge package and the public paper repository at https://github.com/MukundaKatta/ai-eval-forge-paper.

Files

ai-eval-forge-preprint-package.zip

Files (23.3 kB)

Name	Size	Download all
ai-eval-forge-preprint-package.zip md5:0a683bf2469963f943f4836b36f3623b	23.3 kB	Preview Download

	All versions	This version
Views	38	38
Downloads	4	4
Data volume	93.3 kB	93.3 kB

AI Eval Forge: Mixed-Check Regression Testing for LLM and Agent Workflows

Authors/Creators

Description

Files

ai-eval-forge-preprint-package.zip

Files (23.3 kB)