Towards Improving the Pretraining of Large Language Models
Event details
Date | 31.10.2024 |
Hour | 09:00 › 11:00 |
Speaker | Zhengqing Wu |
Location | |
Category | Conferences - Seminars |
EDIC candidacy exam
Exam president: Prof. Martin Jaggi
Thesis advisor: Prof. Volkan Cevher
Co-examiner: Prof. Nicolas Fammarion
Abstract
Training large language models requires co-optimizing numerous hyperparameters (model size, learning rate, batch size, etc.) for various goals (training speed, compute efficiency, generalization performance, etc.), resulting in a complicated task. To tackle such difficulty, one needs to (1) understand how each hyperparameter affects the training, (2) co-optimize different hyperparameters, and (3) strike a balance between different goals. In this talk, I will present three papers that discuss how we can achieve these.
Background papers
[1] Scaling Laws for Neural Language Models, https://arxiv.org/abs/2001.08361.
[2] Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer, https://proceedings.neurips.cc/paper/2021/hash/8df7c2e3c3c3be098ef7b382bd2c37ba-Abstract.html
[3] On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima, https://openreview.net/forum?id=H1oyRlYgg&;amp;noteId=H1oyRlYgg
Exam president: Prof. Martin Jaggi
Thesis advisor: Prof. Volkan Cevher
Co-examiner: Prof. Nicolas Fammarion
Abstract
Training large language models requires co-optimizing numerous hyperparameters (model size, learning rate, batch size, etc.) for various goals (training speed, compute efficiency, generalization performance, etc.), resulting in a complicated task. To tackle such difficulty, one needs to (1) understand how each hyperparameter affects the training, (2) co-optimize different hyperparameters, and (3) strike a balance between different goals. In this talk, I will present three papers that discuss how we can achieve these.
Background papers
[1] Scaling Laws for Neural Language Models, https://arxiv.org/abs/2001.08361.
[2] Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer, https://proceedings.neurips.cc/paper/2021/hash/8df7c2e3c3c3be098ef7b382bd2c37ba-Abstract.html
[3] On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima, https://openreview.net/forum?id=H1oyRlYgg&;amp;noteId=H1oyRlYgg
Practical information
- General public
- Free