AI Center Seminar - AI Fundamentals series - Dr. Noam Razin

Event details

Date	02.09.2025
Hour	11:00 › 12:00
Speaker	Noam Razin
Location	ELE 117 Online
Category	Conferences - Seminars
Event Language	English

The talk is organized by the EPFL AI Center as part of the AI fundamentals seminar series.

Talk followed by a coffee session.

Hosting professor: Prof. Nicolas Flammarion

Title
Understanding and Overcoming Pitfalls in Language Model Alignment

Abstract
Training safe and helpful language models requires aligning them with human preferences. In this talk, I will present theory and experiments highlighting pitfalls in the two most widely adopted approaches: Reinforcement Learning from Human Feedback (RLHF), which trains a reward model based on preference data and then maximizes this reward via RL, and Direct Preference Optimization (DPO), which directly trains the language model on preference data. As detailed below, beyond characterizing these pitfalls, I will provide quantitative measures for identifying when they occur and suggest preventative guidelines.

First, I will show that RLHF suffers from a flat objective landscape that hinders optimization when the reward model induces low reward variance. This issue can arise even if the reward model is highly accurate — challenging conventional wisdom that more accurate reward models are better teachers and revealing limitations of existing reward model benchmarks. Furthermore, I will present practical applications of the connection between reward variance and optimization (e.g., design of data selection and policy gradient methods) and discuss how different reward model parameterizations affect generalization. Then, we will focus on likelihood displacement — a counterintuitive tendency of DPO to decrease the probability of preferred outputs (instead of increasing it as intended). I will characterize mechanisms driving likelihood displacement and demonstrate that it can lead to surprising failures in alignment. In particular, aligning a model to refuse answering unsafe prompts can unintentionally unalign it by shifting probability mass from preferred safe outputs to harmful ones. Our analysis brings forth a data filtering method that allows mitigating such undesirable outcomes of DPO and highlights the importance of curating data with sufficiently distinct preferences.

Bio
Noam Razin is a Postdoctoral Fellow at Princeton Language and Intelligence, Princeton University. His research focuses on the fundamentals of deep learning. In particular, he aims to develop theories that shed light on how deep learning works, identify potential failures, and yield principled methods for improving efficiency, reliability, and performance

Noam obtained his PhD in Computer Science at Tel Aviv University, where he was advised by Nadav Cohen. For his research, Noam received several honors and awards, including: the Zuckerman Postdoctoral Scholarship, the Apple Scholars in AI/ML PhD fellowship, and the Tel Aviv University Center for AI and Data Science excellence fellowship.

Practical information

Informed public
Free

Organizer

EPFL AI Center

Contact

Nicolas Machado

Export Event

Event broadcasted in

Send a reminder