What question did this study set out to answer?

This research examines how chain-of-thought (CoT) strategies affect performance of large language models (LLMs) on various coding benchmarks.

June 1, 2026Open Access

When Chain-of-Thought Helps and When It Hurts: A Communication-Complexity Account of LLM Benchmark Behaviour via the Hdp Bandwidth Bound

Key Points

This research examines how chain-of-thought (CoT) strategies affect performance of large language models (LLMs) on various coding benchmarks.
Implementation of updated scoring for HumanEval with token stripping for accurate output evaluation.
Comparison across models: Qwen-32B, Qwen-7B, and Llama-8B.
Application of McNemar tests post-correction for statistical significance.
The performance of Qwen-32B improved by +23.2 pp with CoT after correction, down from +68.9 pp.
Baseline no-CoT accuracy for Qwen-32B adjusted from 15.9% to 62.2%.
CoT strategy negatively impacts smaller models like Qwen-7B, showing a decline of -28.7 pp.

Abstract

v3 — Substantive correction of the v2 (May 2026) release. CHANGE SUMMARY Inference outputs are unchanged; only the scoring of HumanEval was affected. A parsing oversight in evaluating raw Python outputs failed to strip stop tokens (e. g. , ``), causing functional execution (`SyntaxError`) failures for correctly generated code. This artificially suppressed the no-CoT scores, especially for Qwen-32B (where it dropped to 15. 9%). With proper token stripping (released as an updated results/scoreₕumaneval. py), the dramatic +68. 9 pp CoT boost for Qwen-32B is reduced to +23. 2 pp. The core finding of a model-size-dependent transition on L-complexity tasks is preserved, but the effect magnitudes are much smaller. KEY NUMBER CHANGES - HumanEval CoT delta (Qwen-32B): was +68. 9 pp (v2) -> +23. 2 pp (v3). Baseline no-CoT accuracy corrected from 15. 9% to 62. 2%. - HumanEval CoT delta (Qwen-7B): was -27. 4 pp (v2) -> -28. 7 pp (v3). - HumanEval CoT delta (Llama-8B): was +15. 9 pp (v2) -> +9. 1 pp (v3). - Pre-registered McNemar tests significant after Bonferroni: was 10/15 (v2) -> 9/15 (v3). The Llama-8B HumanEval cell is no longer significant. - GSM8K, MATH, MMLU, ARC-Challenge deltas: unchanged from v2. WHAT v3 ARGUES The core thesis remains the same as v2: The math-side prediction of the Hdp framework (CoT recovers single-pass bandwidth) is strongly supported across all three models on GSM8K and MATH. The negative TC⁰ prediction (CoT actively hurts low-depth tasks) is not supported: CoT is approximately neutral on MMLU and ARC. HumanEval continues to show the predicted model-size-dependent transition (+23. 2 pp for Qwen-32B, +9. 1 pp for Llama-8B, -28. 7 pp for Qwen-7B), confirming that CoT hurts smaller models but helps larger models on intermediate-complexity tasks. PROVENANCE A parser artefact caused the models to receive abnormally low no-CoT scores on HumanEval because special tokens (e. g. ``) were not stripped prior to functional execution, resulting in tracebacks. The `scoreₕumaneval. py` script was updated to strip these tags using regex. This correction has been integrated into the provided replication datasets and the SQLite database.

When Chain-of-Thought Helps and When It Hurts: A Communication-Complexity Account of LLM Benchmark Behaviour via the Hdp Bandwidth Bound

Key Points

Abstract

Cite This Study