Papers of the day   All papers

Overly Optimistic Prediction Results on Imbalanced Data: Flaws and Benefits of Applying Over-sampling

Comments

Jeremy Howard: Interesting paper showing how dozens of studies have accidentally leaked large amounts of data from train->test dataset, by duplicating data items prior to doing a random split.

21 replies, 525 likes


Gilles Vandewiele: Our paper "Overly Optimistic Prediction Results on Imbalanced Data: Flaws and Benefits of Applying Over-sampling" has been published on arXiv: https://arxiv.org/abs/2001.06296 What did we do? A thread... (1/6)

13 replies, 431 likes


Gilles Vandewiele: After discovering this, we were able to reproduce their results, but only when making a fundamental methodological flaw: applying over-sampling before partitioning data into training and testing set. (4/6)

4 replies, 37 likes


Alex Dimakis: Do not apply over-sampling before splitting train and test

1 replies, 15 likes


alex rubinsteyn: Bad ML practices are pervasive in health & biomedical research. Intro stats needs to be scrapped in favor of a class that really convinces people to never touch the test set. Don't use test data for normalization, discretization, oversampling, &c. Lock the test data in a vault.

0 replies, 8 likes


Dillon Niederhut PhD: Evidently some medical researchers have been reporting accuracy on a holdout set set... ...after training on that holdout set 🙊. If your model accuracy is 100%, be just a little bit skeptical.

0 replies, 4 likes


Flávio Clésio: This work shows why methodology matters, especially in sampling strategies.

0 replies, 3 likes


cj battey: Turns out putting exact copies of test samples in your training set makes the test loss look artificially low 🤦‍♂️🤦‍♂️🤦‍♂️.

0 replies, 2 likes


Dr. Greg: Considering that the A1 Top Bird Team can't get the BirdTracker algorithm to work properly, I thought T's idea of creating an AI to break The Curse was overly optimistic at best, but he has already started work #bayareabirds https://arxiv.org/abs/2001.06296

1 replies, 0 likes


Content

Found on Jan 20 2020 at https://arxiv.org/pdf/2001.06296.pdf

PDF content of a computer science paper: Overly Optimistic Prediction Results on Imbalanced Data: Flaws and Benefits of Applying Over-sampling