BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//Memento EPFL//
BEGIN:VEVENT
SUMMARY:What’s In My Big Data? Connecting Between Data and Models Behavi
 or
DTSTART:20231114T110000
DTEND:20231114T120000
DTSTAMP:20260609T084011Z
UID:0ebc8d5968a921e3dcc30eb2de8f1d7a091c01f743d224f4ae68ddde
CATEGORIES:Conferences - Seminars
DESCRIPTION:Yanai Elazar is a postdoctoral researcher on the AllenNLP t
 eam at AI2\, and the University of Washington\, and a Rothschild Fellow
 . He did his PhD (2022) in Computer Science in the NLP lab at Bar-Ilan U
 niversity.  \nYanai Elazar is visiting from the University of Washington 
 to present his most recent work: "What’s In My Big Data? Connecting Betw
 een Data and Models Behavior".\n\nSummary of the talk: \nLarge text corpo
 ra are the backbone of language models. However\, we have a limited unders
 tanding of the content of these corpora\, including general statistics\, q
 uality\, social factors\, and inclusion of evaluation data (contamination)
 .\nIn this work\, we introduce What's In My Big Data? (WIMBD)\, a platform
  and a set of 16 high-level analyses that allow us to reveal and compare t
 he contents of large text corpora. WIMBD builds on two basic capabilities-
 --count and search---at scale\, which allows us to analyze more than 35 te
 rabytes on a standard compute node.\nWe apply WIMBD to 10 different corpor
 a used to train popular language models\, including C4\, The Pile\, and Re
 dPajama. Our analysis uncovers several surprising and previously undocumen
 ted findings about these corpora\, including the high prevalence of duplic
 ate\, synthetic\, and low-quality content\, personally identifiable inform
 ation\, toxic language\, and benchmark contamination. For instance\, we fi
 nd that about 50% of the documents in RedPajama and LAION-2B-en are duplic
 ates. In addition\, several datasets used for benchmarking models trained 
 on such corpora are contaminated with respect to important benchmarks\, in
 cluding the Winograd Schema Challenge and parts of GLUE and SuperGLUE.\nWe
  then discuss follow-up research projects we did using WIMBD\, studying di
 fferent model behaviors originating from the data.
LOCATION:BC 04 https://plan.epfl.ch/?room==BC%2004 https://epfl.zoom.us/j/
 69499602273?pwd=WTBWK1o0L1Z0b085anBiM094STFjQT09
STATUS:CONFIRMED
END:VEVENT
END:VCALENDAR