Optimizing both large language models (LLMs) and small language models (SLMs) for realworld use requires thoughtful post-training adaptation. This overview highlights three key strategies: Supervised Fine-Tuning, Direct Preference Optimization (DPO), and Online Reinforcement Learning. Supervised Fine-Tuning refines pre-trained models using labeled, instruction-following datasets. This improves task accuracy and response controllability by aligning outputs with ground truth examples. Direct Preference Optimization (DPO) simplifies preference-based training by directly integrating human feedback into the reward signal eliminating the need for complex reward models or unstable policy gradients. DPO offers a more stable, efficient alternative to traditional Reinforcement Learning from Human Feedback (RLHF). Online reinforcement learning introduces continuous updates based on real-time user interactions and dynamically generated data. This enhances adaptability, enabling models to better respond to changing user needs and domain shifts. Emerging methods like online DPO and Group Reward Policy Optimization outperform other approaches in both precise (e.g., mathematical reasoning) and open-ended (e.g., instruction following) tasks. Together, these methods, Supervised Fine-Tuning, Direct Preference Optimization, and Online Reinforcement Learning enable more effective, efficient, and controllable adaptation of LLMs and SLMs. By aligning model behavior with user intent while addressing bias and training inefficiencies, they significantly improve language model utility in real-world applications.
Building similarity graph...
Analyzing shared references across papers
Loading...
Cassel Scott-Emuakpor
Building similarity graph...
Analyzing shared references across papers
Loading...
Cassel Scott-Emuakpor (Mon,) studied this question.
www.synapsesocial.com/papers/68c1ac0154b1d3bfb60e47ca — DOI: https://doi.org/10.36227/techrxiv.175427207.75884899/v1