Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics


Swabha Swayamdipta: As datasets have grown larger, data exploration has become increasingly challenging. Our new work on Dataset Cartography, at @emnlp2020 with @royschwartz02, @NickLourie, @yizhongwyz, @HannaHajishirzi, @nlpnoah, @YejinChoinka offers a solution 🗺️ Paper: 1/n

Noah A Smith: Dataset cartography: a new way to look at your training dataset, derived from model training dynamics with respect to each instance. Forthcoming EMNLP paper by @swabhz @royschwartz02 @NickLourie @yizhongwyz @HannaHajishirzi @nlpnoah @YejinChoinka

Roy Schwartz: Training dynamics help us visualize our data, and divide it into clearly distinctive areas: some instances are “easy-to-learn” (for the model). “hard-to-learn” instances contain many annotation errors, and “ambiguous” instances are the highest quality samples for training. 1/2

Swabha Swayamdipta: Updated camera-ready version and code now available! Code: Paper:

Oren Etzioni: Thanks, John! Credit to @swabhz and here co-authors.

John Bohannon: Beautiful work from @allen_ai -- @etzioni just keeps it coming. Dataset Cartography The quality of data is proving to be far more important than quantity. We will definitely try out this technique at @primer_ai as we industrialize text classification.

Chenhao Tan: Only got to catch up with conferencing a little bit tonight, I liked this Data Cartography work by @swabhz and coauthors: The issue of data quality is under-explored and deserves much more attention. This work investigates the angle of training signals.

Swabha Swayamdipta: What a cool application of Data Maps!!

Swabha Swayamdipta: Learn more about our Dataset Cartography📍work at #emnlp2020 in Q&A session 16, happening in 45 mins!

Ravi Shekhar: @EmtiyazKhan looks very interesting. recent "Dataset Cartography" paper by @swabhz et al. is a really simple and intuitive way to show the effect of confidence and variability on the overall training

Oznur Tastan: For this we use the method presented in @swabhz et al.; we inspect the mean and the variation of the true class probabilities across the epochs to generate a list of possibly misannotated lncRNAs. More is in the preprint. Feedback is appreciated.

HotComputerScience: Most popular computer science paper of the day: "Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics"

arXiv CS-CL: Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics

