What question did this study set out to answer?

This research aims to improve reasoning capabilities of Large Language Models through enhanced training methods.

February 13, 2026Open Access

Use large language model to enhance reasoning of another large language model through reward updated GRPO

Key Points

This research aims to improve reasoning capabilities of Large Language Models through enhanced training methods.
Introduced a novel package for generating reasoning data from various data sources.
Proposed an updated objective function for Group Relative Policy Optimization with a custom reward mechanism.
Created a publicly available dataset and trained model for broader accessibility and research.
Best-performing model achieved 98.2% mean token accuracy on GSM8K and 98.5% on Warren Buffett Letters datasets.
Model training had a compute time of 40–42 hours and cost of $78–$82.

Abstract

Recent advancements in deep learning have significantly transformed natural language processing (NLP), enabling sophisticated reasoning and text generation. However, fine-tuning Large Language Models (LLMs) for domain-specific tasks remains a challenge due to the need for curated datasets. This paper introduces a novel package that allows developers to generate reasoning data from any data source, enhancing LLM adaptability across various domains. Additionally, we propose an updated objective function for Group Relative Policy Optimization (GRPO) with a novel reward component to improve training efficiency and model performance. Instead of competing with a few universal benchmarks, we propose a framework to set custom reward functions and design experimental processes to converge on the custom reward function. To support further research, we publicly release our dataset and trained model, facilitating broader adoption and evaluation. Our contributions include (1) a publicly available package for reasoning data generation using LLMs, (2) an enhanced GRPO objective function with a novel reward mechanism, and (3) open access to the dataset and model to promote continued advancements in LLM training. By addressing key challenges in fine-tuning and optimization, our work provides valuable resources for the NLP community and contributes to improving reasoning capabilities in LLMs. Finally, we present results comparing model performance on GSM8K and Warren Buffett Letters datasets. The best-performing model, Qwen 2. 5-3B-Instruct, achieved 98. 2% and 98. 5% mean token accuracy on the respective datasets, with a compute time of 40–42 hours and a cost of 78–82.

Read Full Paperexternally

Perguntar à IA

Bookmark

View Full Paper

Cite This Study

Yiqiao Yin (Wed,) studied this question.

synapsesocial.com/papers/698e70d96645d80bf91f9a42 https://doi.org/https://doi.org/10.1038/s41598-026-39296-8

Perguntar à IA

Bookmark

View Full Paper