Multi-step reasoning is essential for large language models (LLMs), yet multilingual performance remains challenging. While Chain-of-Thought (CoT) prompting improves reasoning, it struggles with non-English languages due to the entanglement of reasoning and execution. Program-of-Thought (PoT) prompting separates reasoning from execution, offering a promising alternative but shifting the challenge to generating programs from non-English questions. We propose a framework to evaluate PoT by separating multilingual reasoning from code execution to examine (i) the impact of fine-tuning on question-reasoning alignment and (ii) how reasoning quality affects answer correctness. Our findings demonstrate that PoT fine-tuning substantially enhances multilingual reasoning, outperforming CoT fine-tuned models. We further demonstrate a strong correlation between reasoning quality (measured through code quality) and answer accuracy, highlighting its potential as a test-time performance improvement heuristic.
Building similarity graph...
Analyzing shared references across papers
Loading...
Patomporn Payoungkhamdee
Pume Tuchinda
Jinheon Baek
Building similarity graph...
Analyzing shared references across papers
Loading...
Payoungkhamdee et al. (Tue,) studied this question.
www.synapsesocial.com/papers/68f0d5eb105731330a2b2072 — DOI: https://doi.org/10.48550/arxiv.2502.17956