BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//Memento EPFL//
BEGIN:VEVENT
SUMMARY:Fuel and Interpreter: Data Curriculum Design for Foundation Models
DTSTART:20230831T100000
DTEND:20230831T120000
DTSTAMP:20260408T060400Z
UID:09d4fd842e3affd38068b8ba66227004e3afeab83bb31a1bd70ee889
CATEGORIES:Conferences - Seminars
DESCRIPTION:Simin Fan\nEDIC candidacy exam\nExam president: Prof. Antoine 
 Bosselut\nThesis advisor: Prof. Martin Jaggi\nCo-examiner: Prof. Robert We
 st\n\nAbstract\nWith the vigorous growth of network scales\, their demand 
 for training data is comparably inflating. However\, with the impressive t
 raining gain from numerous web-crawled datasets\, people are turning blind
  to the fundamental causality rooted in the inherent feature of input samp
 les. We believe a data-side curriculum built on their inherent quality wou
 ld not only be the fuel to accelerate the training process\, but also a po
 st-hoc interpreter to provide an informative assessment on the large found
 ation models. In this research proposal\, we analyze three papers in the f
 ield of data quality assessment and data curriculum design to support larg
 e model pre-training. We further show the potential to further use the fee
 dback from the model to optimize data selection criterion. In the last ses
 sion\, we describe the insights and plan in the future exploration along t
 he data-side curriculum design in the large foundation model era. \n\nBac
 kground papers\n1) DoReMi: Optimizing Data Mixtures Speeds Up Language Mod
 el Pretraining https://arxiv.org/abs/2305.10429.\n2) Prioritized Training
  on Points that are Learnable\, Worth Learning\, and not yet Learnt https
 ://proceedings.mlr.press/v162/mindermann22a.html\n3) Skill-it! A Data-Driv
 en Skills Framework for Understanding and Training Language Models https:
 //arxiv.org/abs/2307.14430\n 
LOCATION:
STATUS:CONFIRMED
END:VEVENT
END:VCALENDAR
