Fuel and Interpreter: Data Curriculum Design for Foundation Models

Thumbnail

Event details

Date 31.08.2023
Hour 10:0012:00
Speaker Simin Fan
Category Conferences - Seminars
EDIC candidacy exam
Exam president: Prof. Antoine Bosselut
Thesis advisor: Prof. Martin Jaggi
Co-examiner: Prof. Robert West

Abstract
With the vigorous growth of network scales, their demand for training data is comparably inflating. However, with the impressive training gain from numerous web-crawled datasets, people are turning blind to the fundamental causality rooted in the inherent feature of input samples. We believe a data-side curriculum built on their inherent quality would not only be the fuel to accelerate the training process, but also a post-hoc interpreter to provide an informative assessment on the large foundation models. In this research proposal, we analyze three papers in the field of data quality assessment and data curriculum design to support large model pre-training. We further show the potential to further use the feedback from the model to optimize data selection criterion. In the last session, we describe the insights and plan in the future exploration along the data-side curriculum design in the large foundation model era. 

Background papers
1) DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining https://arxiv.org/abs/2305.10429.
2) Prioritized Training on Points that are Learnable, Worth Learning, and not yet Learnt https://proceedings.mlr.press/v162/mindermann22a.html
3) Skill-it! A Data-Driven Skills Framework for Understanding and Training Language Models https://arxiv.org/abs/2307.14430
 

Practical information

  • General public
  • Free

Tags

EDIC candidacy exam

Share