This paper identifies the core structural failings in prompt‑test design and evaluation for LLMs. It shows that the methods currently used to assess model behaviour cannot produce reliable signals: they mismeasure capability, misinterpret outputs, and often generate failure states created by the tests themselves. These practices emerged in an industry expanding faster than it can define standards, leaving evaluation shaped by inconsistent methods and gatekeepers with little grounding in the systems they are judging. By examining how prompt‑tests are constructed, how their results are interpreted, and why these processes collapse under scrutiny, the paper demonstrates that prompt‑based testing is not a viable measure of model performance. The objective is to: expose the mechanisms of failure and define the conditions required for evaluation that reflects how models actually behave.
William Argo (Wed,) studied this question.