ContextBench is an open-source benchmark for evaluating context engineering strategies in LLM-based enterprise workflow automation. The study compares prompt-only, retrieval-augmented generation, steering-document, memory-file, and combined context configurations on a synthetic software ticket-triage benchmark. The evaluation measures category accuracy, priority accuracy, customer-impact classification, evidence coverage, schema validity, and human-review behavior. Results from a 50-task OpenAI pilot show that context strategy changes not only classification accuracy but also evidence grounding and review burden. The paper argues that reliable LLM workflow automation requires systematic evaluation of context packaging, retrieval, memory, and operational steering rather than prompt design alone. The associated implementation, benchmark data, context documents, and evaluation scripts are publicly available at: https://github.com/chebrma99/ContextBench
Manoja Chebrolu (Wed,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: