AI Center Seminar - AI Fundamentals series - Dr. Noam Razin

Thumbnail

Event details

Date 02.09.2025
Hour 11:0012:00
Speaker Noam Razin
Location Online
Category Conferences - Seminars
Event Language English

The talk is organized by the EPFL AI Center as part of the AI fundamentals seminar series.

Talk followed by a coffee session.

Hosting professor: Prof. Nicolas Flammarion

Title
Understanding and Overcoming Pitfalls in Language Model Alignment

Abstract
Training safe and helpful language models requires aligning them with human preferences. In this talk, I will present theory and experiments highlighting pitfalls in the two most widely adopted approaches: Reinforcement Learning from Human Feedback (RLHF), which trains a reward model based on preference data and then maximizes this reward via RL, and Direct Preference Optimization (DPO), which directly trains the language model on preference data. As detailed below, beyond characterizing these pitfalls, I will provide quantitative measures for identifying when they occur and suggest preventative guidelines.

First, I will show that RLHF suffers from a flat objective landscape that hinders optimization when the reward model induces low reward variance. This issue can arise even if the reward model is highly accurate — challenging conventional wisdom that more accurate reward models are better teachers and revealing limitations of existing reward model benchmarks. Furthermore, I will present practical applications of the connection between reward variance and optimization (e.g., design of data selection and policy gradient methods) and discuss how different reward model parameterizations affect generalization. Then, we will focus on likelihood displacement — a counterintuitive tendency of DPO to decrease the probability of preferred outputs (instead of increasing it as intended). I will characterize mechanisms driving likelihood displacement and demonstrate that it can lead to surprising failures in alignment. In particular, aligning a model to refuse answering unsafe prompts can unintentionally unalign it by shifting probability mass from preferred safe outputs to harmful ones. Our analysis brings forth a data filtering method that allows mitigating such undesirable outcomes of DPO and highlights the importance of curating data with sufficiently distinct preferences.

Bio
Noam Razin is a Postdoctoral Fellow at Princeton Language and Intelligence, Princeton University. His research focuses on the fundamentals of deep learning. In particular, he aims to develop theories that shed light on how deep learning works, identify potential failures, and yield principled methods for improving efficiency, reliability, and performance

Noam obtained his PhD in Computer Science at Tel Aviv University, where he was advised by Nadav Cohen. For his research, Noam received several honors and awards, including: the Zuckerman Postdoctoral Scholarship, the Apple Scholars in AI/ML PhD fellowship, and the Tel Aviv University Center for AI and Data Science excellence fellowship.
 

Links

Practical information

  • Informed public
  • Free

Organizer

  • EPFL AI Center

Contact

  • Nicolas Machado

Tags

RLHF DPO Reward Variance Likelihood Displacement Alignment Pitfalls

Event broadcasted in

Share