v3 — Substantive correction of the v2 (May 2026) release. CHANGE SUMMARY Inference outputs are unchanged; only the scoring of HumanEval was affected. A parsing oversight in evaluating raw Python outputs failed to strip stop tokens (e. g. , ``), causing functional execution (`SyntaxError`) failures for correctly generated code. This artificially suppressed the no-CoT scores, especially for Qwen-32B (where it dropped to 15. 9%). With proper token stripping (released as an updated results/scoreₕumaneval. py), the dramatic +68. 9 pp CoT boost for Qwen-32B is reduced to +23. 2 pp. The core finding of a model-size-dependent transition on L-complexity tasks is preserved, but the effect magnitudes are much smaller. KEY NUMBER CHANGES - HumanEval CoT delta (Qwen-32B): was +68. 9 pp (v2) -> +23. 2 pp (v3). Baseline no-CoT accuracy corrected from 15. 9% to 62. 2%. - HumanEval CoT delta (Qwen-7B): was -27. 4 pp (v2) -> -28. 7 pp (v3). - HumanEval CoT delta (Llama-8B): was +15. 9 pp (v2) -> +9. 1 pp (v3). - Pre-registered McNemar tests significant after Bonferroni: was 10/15 (v2) -> 9/15 (v3). The Llama-8B HumanEval cell is no longer significant. - GSM8K, MATH, MMLU, ARC-Challenge deltas: unchanged from v2. WHAT v3 ARGUES The core thesis remains the same as v2: The math-side prediction of the Hdp framework (CoT recovers single-pass bandwidth) is strongly supported across all three models on GSM8K and MATH. The negative TC⁰ prediction (CoT actively hurts low-depth tasks) is not supported: CoT is approximately neutral on MMLU and ARC. HumanEval continues to show the predicted model-size-dependent transition (+23. 2 pp for Qwen-32B, +9. 1 pp for Llama-8B, -28. 7 pp for Qwen-7B), confirming that CoT hurts smaller models but helps larger models on intermediate-complexity tasks. PROVENANCE A parser artefact caused the models to receive abnormally low no-CoT scores on HumanEval because special tokens (e. g. ``) were not stripped prior to functional execution, resulting in tracebacks. The `scoreₕumaneval. py` script was updated to strip these tags using regex. This correction has been integrated into the provided replication datasets and the SQLite database.
Tughanbulut Kurtulush (Sat,) studied this question.