External FLAIR seminar: Yuhai Tu

Thumbnail

Event details

Date 21.10.2022
Hour 13:1514:15
Speaker Yuhai Tu
Location
Category Conferences - Seminars
Event Language English
Title: Can physicists help understand Deep Learning?
Speaker: Yuhai Tu (IBM T. J. Watson Research Center)

Abstract: Despite the great success of deep learning, it remains largely a black box. In this seminar, we will describe our recent work in understanding learning dynamics and generalization of deep neural networks based on concepts and tools from statistical physics. 
 
(1) SGD Learning dynamics: The main search engine in deep neural networks is the Stochastic Gradient Descent (SGD) algorithm, however, little is known about how SGD finds ``good" solutions (low generalization error) in the high-dimensional weight space. By studying weight fluctuations in SGD, we find a robust inverse relation between the weight variance in SGD and the landscape flatness, which is the opposite to the fluctuation-dissipation (response) relation in equilibrium statistical physics. We show that the noise strength in SGD depends inversely on the landscape flatness, which explains the inverse variance-flatness relation. Our study suggests that SGD serves as an ``intelligent" annealing strategy where the effective temperature self-adjusts according to the loss landscape, which allows it to find the flat minimum regions that contain generalizable solutions. Finally, we discuss an application of these insights for reducing catastrophic forgetting efficiently for sequential multiple tasks learning [1].
 
(2) Geometric determinants of generalization: We first report the discovery of duality relations between changes in activities in a densely connected layer of neurons and the changes in their weights connecting to the next layer. The activity-weight duality leads to an explicit expression for the generalization loss, which can be decomposed into contributions from different directions in weight space. We find that the generalization loss from each direction is the product of two geometric factors (determinants): sharpness of the loss landscape at the solution and the standard deviation of the dual weights, which scales as an activity-weighted norm of the solution. By using the generalization loss decomposition, we uncover how hyperparameters in SGD, different regularization schemes (e.g., weight decay and dropout), training data size, and labeling noise affect generalization by controlling one or both factors [2].
 
 
[1] “The inverse variance-flatness relation in Stochastic-Gradient-Descent is critical for finding flat minima”, Y. Feng and Y. Tu, PNAS, 118 (9), 2021.
 
[2] “The activity-weight duality in feed forward neural networks: The geometric determinants of generalization”, Y. Feng and Y. Tu, https://arxiv.org/abs/2203.10736

Practical information

  • Informed public
  • Free

Contact

Share