AI Center Seminar - AI Fundamentals series - Dr. Noam Razin

Event details
Date | 02.09.2025 |
Hour | 11:00 › 12:00 |
Speaker | Noam Razin |
Location | Online |
Category | Conferences - Seminars |
Event Language | English |
The talk is organized by the EPFL AI Center as part of the AI fundamentals seminar series.
Talk followed by a coffee session.
Hosting professor: Prof. Nicolas Flammarion
Title
Understanding and Overcoming Pitfalls in Language Model Alignment
Abstract
Training safe and helpful language models requires aligning them with human preferences. In this talk, I will present theory and experiments highlighting pitfalls in the two most widely adopted approaches: Reinforcement Learning from Human Feedback (RLHF), which trains a reward model based on preference data and then maximizes this reward via RL, and Direct Preference Optimization (DPO), which directly trains the language model on preference data. As detailed below, beyond characterizing these pitfalls, I will provide quantitative measures for identifying when they occur and suggest preventative guidelines.
First, I will show that RLHF suffers from a flat objective landscape that hinders optimization when the reward model induces low reward variance. This issue can arise even if the reward model is highly accurate — challenging conventional wisdom that more accurate reward models are better teachers and revealing limitations of existing reward model benchmarks. Furthermore, I will present practical applications of the connection between reward variance and optimization (e.g., design of data selection and policy gradient methods) and discuss how different reward model parameterizations affect generalization. Then, we will focus on likelihood displacement — a counterintuitive tendency of DPO to decrease the probability of preferred outputs (instead of increasing it as intended). I will characterize mechanisms driving likelihood displacement and demonstrate that it can lead to surprising failures in alignment. In particular, aligning a model to refuse answering unsafe prompts can unintentionally unalign it by shifting probability mass from preferred safe outputs to harmful ones. Our analysis brings forth a data filtering method that allows mitigating such undesirable outcomes of DPO and highlights the importance of curating data with sufficiently distinct preferences.
Bio
Noam Razin is a Postdoctoral Fellow at Princeton Language and Intelligence, Princeton University. His research focuses on the fundamentals of deep learning. In particular, he aims to develop theories that shed light on how deep learning works, identify potential failures, and yield principled methods for improving efficiency, reliability, and performance
Noam obtained his PhD in Computer Science at Tel Aviv University, where he was advised by Nadav Cohen. For his research, Noam received several honors and awards, including: the Zuckerman Postdoctoral Scholarship, the Apple Scholars in AI/ML PhD fellowship, and the Tel Aviv University Center for AI and Data Science excellence fellowship.
Links
Practical information
- Informed public
- Free
Organizer
- EPFL AI Center
Contact
- Nicolas Machado