Published May 5, 2026 | Version v1
Publication Open

AI Eval Forge: Mixed-Check Regression Testing for LLM and Agent Workflows

Authors/Creators

Description

Large-model and agent teams often need faster regression checks than broad benchmark suites can provide. This paper presents AI Eval Forge, a zero-dependency evaluation harness for mixed-check regression testing across LLM and agent workflows. The tool supports exact-match, substring, regex, token-F1, JSON validity, JSON field equality, citation coverage, and bounded custom-expression checks in a compact case format that works with JSON or JSONL. The contribution is not a new benchmark. It is a small, inspectable evaluation layer that helps teams compare runs, catch regressions, and summarize pass rate, score, cost, and latency without standing up a heavy evaluation stack. The paper describes the harness design, check model, reporting format, and practical role of mixed-check cases in real workflow testing. The artifact bundle is connected to the ai-eval-forge package and the public paper repository at https://github.com/MukundaKatta/ai-eval-forge-paper.

Files

ai-eval-forge-preprint-package.zip

Files (23.3 kB)

Name Size Download all
md5:0a683bf2469963f943f4836b36f3623b
23.3 kB Preview Download