Reward Model Learning vs Direct Preference Optimization: A Comparative Analysis of Learning from Human Preferences
Event details
| Date | 13.02.2026 |
| Hour | 11:15 › 12:00 |
| Speaker | Andi Nika, doctoral researcher at the Max Planck Institute for Software Systems, Kaiserslautern, Germany |
| Location | |
| Category | Conferences - Seminars |
| Event Language | English |
Abstract: Reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO) are two leading reinforcement-learning–based approaches for fine-tuning large language models. In this talk, we study these methods from a statistical and robustness-theoretic perspective. Focusing on log-linear policy parameterizations and linear reward functions, we derive order-optimal bounds on the suboptimality gap for both RLHF and DPO under a range of oracle models and realizability assumptions. These results enable a comparison of the two approaches, delineating the regimes in which one method provably outperforms the other. We conclude the talk with a discussion of the relative susceptibility of RLHF and DPO to data poisoning attacks.
Bio: Andi Nika is a doctoral researcher at the Max Planck Institute for Software Systems. He completed graduate studies in Electrical and Electronics Engineering and Mathematics at Bilkent University, Ankara, and received his undergraduate degree in Mathematics from the University of Tirana. His research focuses on reinforcement learning theory, with particular emphasis on multi-agent systems and post-training methods for generative models.
Practical information
- General public
- Free
Organizer
- Prof Maryam Kamgarpour