BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//Memento EPFL//
BEGIN:VEVENT
SUMMARY:Reward Model Learning vs Direct Preference Optimization: A Compara
 tive Analysis of Learning from Human Preferences
DTSTART:20260213T111500
DTEND:20260213T120000
DTSTAMP:20260408T044331Z
UID:dd229c8b466476454a6d9587d89d247b39ef8dcb971283ab57b9981e
CATEGORIES:Conferences - Seminars
DESCRIPTION:Andi Nika\, doctoral researcher at the Max Planck Institute fo
 r Software Systems\, Kaiserslautern\, Germany\nAbstract: Reinforcement lea
 rning from human feedback (RLHF) and direct preference optimization (DPO) 
 are two leading reinforcement-learning–based approaches for fine-tuning 
 large language models. In this talk\, we study these methods from a statis
 tical and robustness-theoretic perspective. Focusing on log-linear policy 
 parameterizations and linear reward functions\, we derive order-optimal bo
 unds on the suboptimality gap for both RLHF and DPO under a range of oracl
 e models and realizability assumptions. These results enable a comparison 
 of the two approaches\, delineating the regimes in which one method provab
 ly outperforms the other. We conclude the talk with a discussion of the r
 elative susceptibility of RLHF and DPO to data poisoning attacks.\n \nBio
 : Andi Nika is a doctoral researcher at the Max Planck Institute for Soft
 ware Systems. He completed graduate studies in Electrical and Electronics 
 Engineering and Mathematics at Bilkent University\, Ankara\, and received 
 his undergraduate degree in Mathematics from the University of Tirana. His
  research focuses on reinforcement learning theory\, with particular empha
 sis on multi-agent systems and post-training methods for generative models
 .
LOCATION:ME C2 405 https://plan.epfl.ch/?room==ME%20C2%20405
STATUS:CONFIRMED
END:VEVENT
END:VCALENDAR