20250604 袁雁城 Accelerating RLHF Training with Reward Variance Increase

发布时间：2025-06-03 11:48 浏览次数：来源：

报告题目：Accelerating RLHF Training with Reward Variance Increase

报告人：袁雁城（香港理工大学应用数学系）

邀请人：陈亮

报告时间：2025年6月4日15: 00—17: 00

报告地点：博彩平台 425报告厅

摘要：Reinforcement learning from human feedback (RLHF) is an essential technique for ensuring that large language models (LLMs) are aligned with human values and preferences during the post-training phase. As an effective RLHF approach, group relative policy optimization (GRPO) has demonstrated success in many LLM-based applications. However, efficient GRPO-based RLHF training remains a challenge. Recent studies reveal that a higher reward variance of the initial policy model leads to faster RLHF training. Inspired by this finding, we propose a practical reward adjustment model to accelerate RLHF training by provably increasing the reward variance and preserving the relative preferences and reward expectation. Our reward adjustment method inherently poses a nonconvex optimization problem, which is NP-hard to solve in general. To overcome the computational challenges, we design a novel $O(n \log n)$ algorithm to find a global solution of the nonconvex reward adjustment model by explicitly characterizing the extreme points of the feasible set. As an important application, we naturally integrate this reward adjustment model into the GRPO algorithm, leading to a more efficient GRPO with reward variance increase (GRPOVI) algorithm for RLHF training. As an interesting byproduct, we provide an indirect explanation for the empirical effectiveness of GRPO with rule-based reward for RLHF training, as demonstrated in DeepSeek-R1. Experiment results demonstrate that the GRPOVI algorithm can significantly improve the RLHF training efficiency compared to the original GRPO algorithm.

报告人简介：Yancheng Yuan is an Assistant Professor at the Department of Applied Mathematics, The Hong Kong Polytechnic University. His research focuses on continuous optimization, the mathematical foundation of data science, and data-driven applications. His research has been published in prestigious academic journals and conferences, including SIAM Journal on Optimization, Mathematical Programming Computation, Journal of Machine Learning Research, IEEE Transactions on Pattern Analysis and Machine Intelligence, NeurIPS, ICML, ICLR, ACM WWW, ACM SIGIR. His papers have been selected in Best Paper Award Finalist of ACM WWW 2021 and ACM SIGIR 2024.

下一篇：20250522 张在坤 Non-convergence Analysis of Randomized Direct Search