What question did this study set out to answer?

This research aims to develop and analyze low-rank reinforcement learning methods to improve decision-making in high-dimensional environments with heterogeneous human feedback.

May 31, 2026Open Access

Low-Rank Reinforcement Learning With Heterogeneous Human Feedback: From Recommendation to Large Language Models

Key Points

This research aims to develop and analyze low-rank reinforcement learning methods to improve decision-making in high-dimensional environments with heterogeneous human feedback.
Investigated dynamic assortment problem in high-dimensional e-commerce settings with low-rank user–item interactions.
Proposed low-rank contextual reinforcement learning from human feedback framework for large language models.
Provided theoretical analyses and extensive numerical experiments to assess performance.
Achieved significant reduction in complexity for estimating personalized utilities, improving statistical efficiency.
Demonstrated provable regret bounds that show gains in efficiency over traditional methods.
The proposed framework offered robust performance and theoretical guarantees on sample efficiency under distribution shifts.

Abstract

Modern decision-making systems, from online marketplaces to large language models (LLMs), increasingly rely on high-dimensional environments with human feedback. However, the inherent heterogeneity of user preferences and the massive scale of feature spaces pose significant challenges for statistical efficiency and robust alignment. This dissertation develops and analyzes low-rank reinforcement learning (RL) methods designed to exploit latent structures to achieve scalability and theoretical rigor. In the first part, we investigate the dynamic assortment problem in high-dimensional e-commerce settings. By imposing a low-rank structure on user–item interactions, we significantly reduce the complexity of estimating personalized utilities. We demonstrate how this structure enables efficient exploration-exploitation strategies and provide provable regret bounds that characterize the gain in efficiency over traditional methods. We then assess the performance of our method in the Expedia Hotel recommendation dataset. The second part of this dissertation extends these principles to Reinforcement Learning from Human Feedback (RLHF) within large-scale contextual environments. We propose a low-rank contextual RLHF framework that simultaneously addresses diverse user preferences and the intricate latent spaces typical of modern LLMs. Our approach incorporates personalized reward modeling for alignment, offering theoretical guarantees on sample efficiency and robust performance under distribution shifts. Throughout this work, we provide rigorous theoretical analyses, algorithmic descriptions, and extensive numerical experiments. Together, these contributions illustrate how a low-rank perspective unifies efficiency and robustness in personalized decision-making systems, providing a scalable path for aligning complex models with heterogeneous human values.

Mark Helpful

Bookmark

Relay

View Full Paper