Papers of the day   All papers

Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics

Comments

Swabha Swayamdipta: As datasets have grown larger, data exploration has become increasingly challenging. Our new work on Dataset Cartography, at @emnlp2020 with @royschwartz02, @NickLourie, @yizhongwyz, @HannaHajishirzi, @nlpnoah, @YejinChoinka offers a solution 🗺️ Paper: http://arxiv.org/abs/2009.10795 1/n https://t.co/1hItp5yOx2

2 replies, 256 likes


Noah A Smith: Dataset cartography: a new way to look at your training dataset, derived from model training dynamics with respect to each instance. Forthcoming EMNLP paper by @swabhz @royschwartz02 @NickLourie @yizhongwyz @HannaHajishirzi @nlpnoah @YejinChoinka https://arxiv.org/abs/2009.10795

4 replies, 185 likes


Roy Schwartz: Training dynamics help us visualize our data, and divide it into clearly distinctive areas: some instances are “easy-to-learn” (for the model). “hard-to-learn” instances contain many annotation errors, and “ambiguous” instances are the highest quality samples for training. 1/2 https://t.co/MQEmVJ6ANL

3 replies, 118 likes


Swabha Swayamdipta: Updated camera-ready version and code now available! Code: https://github.com/allenai/cartography Paper: https://arxiv.org/abs/2009.10795

0 replies, 30 likes


Oren Etzioni: Thanks, John! Credit to @swabhz and here co-authors.

0 replies, 23 likes


John Bohannon: Beautiful work from @allen_ai -- @etzioni just keeps it coming. Dataset Cartography https://arxiv.org/abs/2009.10795 The quality of data is proving to be far more important than quantity. We will definitely try out this technique at @primer_ai as we industrialize text classification. https://t.co/BTreT6zmxh

0 replies, 22 likes


Chenhao Tan: Only got to catch up with conferencing a little bit tonight, I liked this Data Cartography work by @swabhz and coauthors: https://twitter.com/swabhz/status/1309217889568854016 The issue of data quality is under-explored and deserves much more attention. This work investigates the angle of training signals.

1 replies, 9 likes


Swabha Swayamdipta: What a cool application of Data Maps!!

0 replies, 9 likes


Swabha Swayamdipta: Learn more about our Dataset Cartography📍work at #emnlp2020 in Q&A session 16, happening in 45 mins!

0 replies, 8 likes


Ravi Shekhar: @EmtiyazKhan looks very interesting. recent "Dataset Cartography" paper by @swabhz et al. is a really simple and intuitive way to show the effect of confidence and variability on the overall training https://arxiv.org/abs/2009.10795

1 replies, 5 likes


Oznur Tastan: For this we use the method presented in @swabhz et al. https://arxiv.org/abs/2009.10795; we inspect the mean and the variation of the true class probabilities across the epochs to generate a list of possibly misannotated lncRNAs. More is in the preprint. Feedback is appreciated.

0 replies, 4 likes


HotComputerScience: Most popular computer science paper of the day: "Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics" https://hotcomputerscience.com/paper/dataset-cartography-mapping-and-diagnosing-datasets-with-training-dynamics https://twitter.com/swabhz/status/1309217889568854016

0 replies, 3 likes


arXiv CS-CL: Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics http://arxiv.org/abs/2009.10795

0 replies, 1 likes


Content

Found on Sep 24 2020 at https://arxiv.org/pdf/2009.10795.pdf

PDF content of a computer science paper: Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics