What’s In My Big Data? Connecting Between Data and Models Behavior

Event details

Date	14.11.2023
Hour	11:00 › 12:00
Speaker	Yanai Elazar is a postdoctoral researcher on the AllenNLP team at AI2, and the University of Washington, and a Rothschild Fellow. He did his PhD (2022) in Computer Science in the NLP lab at Bar-Ilan University.
Location	BC 04 Online
Category	Conferences - Seminars
Event Language	English

Yanai Elazar is visiting from the University of Washington to present his most recent work: "What’s In My Big Data? Connecting Between Data and Models Behavior".

Summary of the talk:
Large text corpora are the backbone of language models. However, we have a limited understanding of the content of these corpora, including general statistics, quality, social factors, and inclusion of evaluation data (contamination).
In this work, we introduce What's In My Big Data? (WIMBD), a platform and a set of 16 high-level analyses that allow us to reveal and compare the contents of large text corpora. WIMBD builds on two basic capabilities---count and search---at scale, which allows us to analyze more than 35 terabytes on a standard compute node.
We apply WIMBD to 10 different corpora used to train popular language models, including C4, The Pile, and RedPajama. Our analysis uncovers several surprising and previously undocumented findings about these corpora, including the high prevalence of duplicate, synthetic, and low-quality content, personally identifiable information, toxic language, and benchmark contamination. For instance, we find that about 50% of the documents in RedPajama and LAION-2B-en are duplicates. In addition, several datasets used for benchmarking models trained on such corpora are contaminated with respect to important benchmarks, including the Winograd Schema Challenge and parts of GLUE and SuperGLUE.
We then discuss follow-up research projects we did using WIMBD, studying different model behaviors originating from the data.

Practical information

Informed public
Free

Organizer

Antoine Bosselut, NLP lab

Contact

Syrielle Montariol

Export Event

Event broadcasted in

Send a reminder