On the role of Weight Decay in modern Deep Learning

Thumbnail

Event details

Date 05.09.2023
Hour 14:0016:00
Speaker Francesco D'Angelo
Location
Category Conferences - Seminars
EDIC candidacy exam
Exam president: Prof. Martin Jaggi
Thesis advisor: Prof. Nicolas Flammarion
Co-examiner: Prof. Florent Krzakala

Abstract
This manuscript presents recent insights into the
role of Weight Decay in modern deep learning. The experiments
in the work of Zhang et al. [1] challenge the traditional
view of Weight Decay as a capacity constraint and posit the
necessity of explicit regularization for good generalization. This
prompts a need to explore Weight Decay’s impact on training
dynamics and optimization. Li and Arora [2] show that under
Batch Normalization, there exists an equivalence between the
trajectory in function space of SGD with Weight Decay and
that of SGD with an exponentially increasing learning rate. This
result challenges the conventional optimization understanding
and highlights the confounding effect that can stem from the deployment
of normalization layers. Finally, the work of Li et al. [3]
introduces an SDE framework in which the interaction between
learning rate schedules, Weight Decay and Batch Normalization
is jointly studied. Their analysis unveils the existence an intrinsic
learning rate parameter, which controls the speed of learning
and the equilibrium distribution in function space. This defies the
widespread belief that large initial learning rates are essential for
good generalization. We conclude the manuscript with a proposal
describing the next steps towards explaining the role of Weight
Decay in present-day deep learning.

Background papers

Practical information

  • General public
  • Free

Tags

EDIC candidacy exam

Share