Published May 5, 2026
| Version v1
Publication
Open
AI Eval Forge: Mixed-Check Regression Testing for LLM and Agent Workflows
Authors/Creators
Description
Large-model and agent teams often need faster regression checks than broad benchmark suites can provide. This paper presents AI Eval Forge, a zero-dependency evaluation harness for mixed-check regression testing across LLM and agent workflows. The tool supports exact-match, substring, regex, token-F1, JSON validity, JSON field equality, citation coverage, and bounded custom-expression checks in a compact case format that works with JSON or JSONL. The contribution is not a new benchmark. It is a small, inspectable evaluation layer that helps teams compare runs, catch regressions, and summarize pass rate, score, cost, and latency without standing up a heavy evaluation stack. The paper describes the harness design, check model, reporting format, and practical role of mixed-check cases in real workflow testing.
The artifact bundle is connected to the ai-eval-forge package and the public paper repository at https://github.com/MukundaKatta/ai-eval-forge-paper.
Files
ai-eval-forge-preprint-package.zip
Files
(23.3 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:0a683bf2469963f943f4836b36f3623b
|
23.3 kB | Preview Download |