What question did this study set out to answer?

The aim is to introduce AI Eval Forge as a tool for efficient regression testing across large-model and agent workflows.

May 7, 2026Open Access

AI Eval Forge: Mixed-Check Regression Testing for LLM and Agent Workflows

Key Points

The aim is to introduce AI Eval Forge as a tool for efficient regression testing across large-model and agent workflows.
Developed zero-dependency evaluation harness for regression testing.
Supports checks such as exact-match, substring, regex, and JSON validity.
Described the design, reporting format, and practical applications of mixed-check cases.
Facilitates quicker regression checks compared to broad benchmark suites.
Summarizes pass rate, score, cost, and latency metrics effectively.
Allows teams to catch regressions without the need for a heavy evaluation stack.

Abstract

Large-model and agent teams often need faster regression checks than broad benchmark suites can provide. This paper presents AI Eval Forge, a zero-dependency evaluation harness for mixed-check regression testing across LLM and agent workflows. The tool supports exact-match, substring, regex, token-F1, JSON validity, JSON field equality, citation coverage, and bounded custom-expression checks in a compact case format that works with JSON or JSONL. The contribution is not a new benchmark. It is a small, inspectable evaluation layer that helps teams compare runs, catch regressions, and summarize pass rate, score, cost, and latency without standing up a heavy evaluation stack. The paper describes the harness design, check model, reporting format, and practical role of mixed-check cases in real workflow testing. The artifact bundle is connected to the ai-eval-forge package and the public paper repository at https://github.com/MukundaKatta/ai-eval-forge-paper.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper