Large Language Models (LLMs) have revolutionised Natural Language Processing (NLP) by scalably identifying patterns in text to predict subsequent tokens with minimal human supervision. However, unlike humans who readily employ external tools such as calculators and spreadsheets to manage and reason over large-scale structured information, LLMs are typically self-contained, relying solely on their learned weights to reason over the input context. This limitation hinders their performance on tasks that require precise, scalable reasoning over structured data, where intuitive pattern matching often fails. This thesis argues that LLMs can overcome this limitation by learning to write and utilise computer programs as tools, mirroring human strategies of externalising reasoning to enhance accuracy on tasks involving structured information. To substantiate this claim, this thesis first establishes that transformer-based LLMs can translate natural language (NL) into short programs. It further demonstrates that retrieval mechanisms can boost this accuracy by incorporating relevant code snippets. Recognising the challenge of acquiring extensive training data for robust program generation, this work investigates the use of pre-trained LLMs as a source of weak supervision. This research demonstrates that aligning pre-training and fine-tuning objectives through a process termed "reiteration" significantly benefits NL-to-code translation. Furthermore, accuracy is enhanced by leveraging the unsupervised generative capabilities of LLMs to create synthetic aligned data pairs of natural language and code. Subsequently, this thesis demonstrates the concept of tool use by leveraging the code-generation capabilities of language models in the context of semi-structured data. It empirically establishes that LLMs struggle with question answering over increasingly large tables. To address this deficiency, this thesis proposes a novel framework wherein LLMs generate task-specific, executable programs to filter rows. The output from these programs is then fed back into the LLM, exemplifying the use of the tool. I show that the application of these program tools significantly improves the accuracy of LLMs when querying large, structured datasets by enabling them to scalably remove irrelevant information. Finally, the thesis presents a practical, real-world application of tool use, showcasing the effectiveness of language-conditioned code writing in a conversational assistant setting. In this setting, the assistant interprets a user’s utterance, generates corresponding code, executes it to update its internal state, and generates a response. At each turn in the conversation, the system reintroduces the output from the previously executed code to the code-writing LLM. This continuous tool use allows for fluid conversation over long, structured tasks, thereby validating the broader applicability of the tool-augmented LLM paradigm.
Carlos Gemmell Górriz (Thu,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: