Papers of the day   All papers

Overly Optimistic Prediction Results on Imbalanced Data: Flaws and Benefits of Applying Over-sampling


Jan 20 2020 Jeremy Howard

Interesting paper showing how dozens of studies have accidentally leaked large amounts of data from train->test dataset, by duplicating data items prior to doing a random split.
21 replies, 525 likes

Jan 20 2020 Gilles Vandewiele

Our paper "Overly Optimistic Prediction Results on Imbalanced Data: Flaws and Benefits of Applying Over-sampling" has been published on arXiv: What did we do? A thread... (1/6)
13 replies, 426 likes

Jan 20 2020 Gilles Vandewiele

After discovering this, we were able to reproduce their results, but only when making a fundamental methodological flaw: applying over-sampling before partitioning data into training and testing set. (4/6)
4 replies, 37 likes

Jan 21 2020 Alex Dimakis

Do not apply over-sampling before splitting train and test
1 replies, 15 likes

Jan 21 2020 alex rubinsteyn

Bad ML practices are pervasive in health & biomedical research. Intro stats needs to be scrapped in favor of a class that really convinces people to never touch the test set. Don't use test data for normalization, discretization, oversampling, &c. Lock the test data in a vault.
0 replies, 8 likes

Jan 22 2020 Dillon Niederhut PhD

Evidently some medical researchers have been reporting accuracy on a holdout set set... ...after training on that holdout set 🙊. If your model accuracy is 100%, be just a little bit skeptical.
0 replies, 4 likes

Jan 21 2020 Flávio Clésio

This work shows why methodology matters, especially in sampling strategies.
0 replies, 3 likes

Jan 21 2020 cj battey

Turns out putting exact copies of test samples in your training set makes the test loss look artificially low 🤦‍♂️🤦‍♂️🤦‍♂️.
0 replies, 2 likes