Analyzing and Improving the Robustness of Deep Learning Models via Mechanistic Interpretability.

Event details
Date | 26.08.2025 |
Hour | 13:00 › 15:00 |
Speaker | Amel Abdelraheem |
Location | |
Category | Conferences - Seminars |
EDIC candidacy exam
Exam president: Prof. Martin Jaggi
Thesis advisor: Prof. Pascal Frossard
Co-examiner: Prof. Patrick Thiran
Abstract
Pre-trained models are increasingly used as foundations for downstream tasks, which makes their robustness, safety, and reliability increasingly more important. Recent literature highlights a striking property: deep neural networks exhibit an underlying linear structure. Phenomena such as linear mode connectivity show that independently trained models can be joined by low loss paths in weight space, and these paths can be exploited to study and improve adversarial robustness. More recently, linear mode connectivity has been linked to the distinct internal mechanisms that models use to make predictions, helping to explain why fine-tuning alone may fail to remove spurious correlations. Building on this insight, model editing techniques, specifically task arithmetic, demonstrate that traversing directions in weight space can edit a model's behavior, strengthening or suppressing certain capabilities without full retraining and thereby enhancing robustness. Taken together, these results encourage examining pre-trained networks through a mechanistic lens, providing concrete tools for analyzing and steering deep learning models.
Selected papers
Exam president: Prof. Martin Jaggi
Thesis advisor: Prof. Pascal Frossard
Co-examiner: Prof. Patrick Thiran
Abstract
Pre-trained models are increasingly used as foundations for downstream tasks, which makes their robustness, safety, and reliability increasingly more important. Recent literature highlights a striking property: deep neural networks exhibit an underlying linear structure. Phenomena such as linear mode connectivity show that independently trained models can be joined by low loss paths in weight space, and these paths can be exploited to study and improve adversarial robustness. More recently, linear mode connectivity has been linked to the distinct internal mechanisms that models use to make predictions, helping to explain why fine-tuning alone may fail to remove spurious correlations. Building on this insight, model editing techniques, specifically task arithmetic, demonstrate that traversing directions in weight space can edit a model's behavior, strengthening or suppressing certain capabilities without full retraining and thereby enhancing robustness. Taken together, these results encourage examining pre-trained networks through a mechanistic lens, providing concrete tools for analyzing and steering deep learning models.
Selected papers
Practical information
- General public
- Free
Contact
- edic@epfl.ch