Recent advancements in deep learning have significantly transformed natural language processing (NLP), enabling sophisticated reasoning and text generation. However, fine-tuning Large Language Models (LLMs) for domain-specific tasks remains a challenge due to the need for curated datasets. This paper introduces a novel package that allows developers to generate reasoning data from any data source, enhancing LLM adaptability across various domains. Additionally, we propose an updated objective function for Group Relative Policy Optimization (GRPO) with a novel reward component to improve training efficiency and model performance. Instead of competing with a few universal benchmarks, we propose a framework to set custom reward functions and design experimental processes to converge on the custom reward function. To support further research, we publicly release our dataset and trained model, facilitating broader adoption and evaluation. Our contributions include (1) a publicly available package for reasoning data generation using LLMs, (2) an enhanced GRPO objective function with a novel reward mechanism, and (3) open access to the dataset and model to promote continued advancements in LLM training. By addressing key challenges in fine-tuning and optimization, our work provides valuable resources for the NLP community and contributes to improving reasoning capabilities in LLMs. Finally, we present results comparing model performance on GSM8K and Warren Buffett Letters datasets. The best-performing model, Qwen 2. 5-3B-Instruct, achieved 98. 2% and 98. 5% mean token accuracy on the respective datasets, with a compute time of 40–42 hours and a cost of 78–82.
Building similarity graph...
Analyzing shared references across papers
Loading...
Yiqiao Yin
Scientific Reports
Columbia University
New York University
Building similarity graph...
Analyzing shared references across papers
Loading...
Yiqiao Yin (Wed,) studied this question.
www.synapsesocial.com/papers/698e70d96645d80bf91f9a42 — DOI: https://doi.org/10.1038/s41598-026-39296-8