What question did this study set out to answer?

This research investigates how large language models can enhance accuracy by generating and using computer programs as tools for reasoning over structured data.

June 20, 2026Open Access

Externalising reasoning by teaching language models to use programs as tools

Key Points

This research investigates how large language models can enhance accuracy by generating and using computer programs as tools for reasoning over structured data.
Established the translation of natural language into short programs using transformer-based LLMs.
Demonstrated the effectiveness of retrieving relevant code snippets to boost accuracy.
Developed a novel framework for generating task-specific executable programs to improve LLMs' performance on large, structured datasets.
LLMs showed improved accuracy in structured query tasks when using generated code tools, with significant reductions in irrelevant information processing.
The iterative code generation from user prompts allowed for smoother conversation flow in assistant applications.
The use of synthetic aligned data pairs enhanced the training process, leading to better NL-to-code translation capabilities.

Abstract

Large Language Models (LLMs) have revolutionised Natural Language Processing (NLP) by scalably identifying patterns in text to predict subsequent tokens with minimal human supervision. However, unlike humans who readily employ external tools such as calculators and spreadsheets to manage and reason over large-scale structured information, LLMs are typically self-contained, relying solely on their learned weights to reason over the input context. This limitation hinders their performance on tasks that require precise, scalable reasoning over structured data, where intuitive pattern matching often fails. This thesis argues that LLMs can overcome this limitation by learning to write and utilise computer programs as tools, mirroring human strategies of externalising reasoning to enhance accuracy on tasks involving structured information. To substantiate this claim, this thesis first establishes that transformer-based LLMs can translate natural language (NL) into short programs. It further demonstrates that retrieval mechanisms can boost this accuracy by incorporating relevant code snippets. Recognising the challenge of acquiring extensive training data for robust program generation, this work investigates the use of pre-trained LLMs as a source of weak supervision. This research demonstrates that aligning pre-training and fine-tuning objectives through a process termed "reiteration" significantly benefits NL-to-code translation. Furthermore, accuracy is enhanced by leveraging the unsupervised generative capabilities of LLMs to create synthetic aligned data pairs of natural language and code. Subsequently, this thesis demonstrates the concept of tool use by leveraging the code-generation capabilities of language models in the context of semi-structured data. It empirically establishes that LLMs struggle with question answering over increasingly large tables. To address this deficiency, this thesis proposes a novel framework wherein LLMs generate task-specific, executable programs to filter rows. The output from these programs is then fed back into the LLM, exemplifying the use of the tool. I show that the application of these program tools significantly improves the accuracy of LLMs when querying large, structured datasets by enabling them to scalably remove irrelevant information. Finally, the thesis presents a practical, real-world application of tool use, showcasing the effectiveness of language-conditioned code writing in a conversational assistant setting. In this setting, the assistant interprets a user’s utterance, generates corresponding code, executes it to update its internal state, and generates a response. At each turn in the conversation, the system reintroduces the output from the previously executed code to the code-writing LLM. This continuous tool use allows for fluid conversation over long, structured tasks, thereby validating the broader applicability of the tool-augmented LLM paradigm.

Demander à l'IA

Bookmark

View Full Paper