As large language models (LLMs) are increasingly deployed in diverse user facing applications, aligning them with real user preferences becomes essential. Existing methods like Reinforcement Learning from Human Feedback (RLHF) rely on expert annotators trained on manually defined guidelines, whose judgments may not reflect the priorities of everyday users. We introduce Reinforcement Learning from User Feedback (RLUF), a framework for aligning LLMs directly to implicit signals from users in production. RLUF addresses key challenges of user feedback: user feedback is often binary (e.g., emoji reactions), sparse, and occasionally adversarial. We train a reward model, PLove, to predict the likelihood that an LLM response will receive a Love Reaction, a lightweight form of positive user feedback, and integrate PLove into a multi-objective policy optimization framework alongside helpfulness and safety objectives. In large-scale experiments, we show that PLove is predictive of increased positive feedback and serves as a reliable offline evaluator of future user behavior. Policy optimization using PLove significantly raises observed positive-feedback rates, including a 28% increase in Love Reactions during live A/B tests. However, optimizing for positive reactions introduces reward hacking challenges, requiring careful balancing of objectives. By directly leveraging implicit signals from users, RLUF offers a path to aligning LLMs with real-world user preferences at scale.
Building similarity graph...
Analyzing shared references across papers
Loading...
Eun‐Jung Han
Jun Chen
Karthik Abinav Sankararaman
Building similarity graph...
Analyzing shared references across papers
Loading...
Han et al. (Tue,) studied this question.
www.synapsesocial.com/papers/68f5a78aab63786de5b4614a — DOI: https://doi.org/10.48550/arxiv.2505.14946