Jeremy Howard: Interesting paper showing how dozens of studies have accidentally leaked large amounts of data from train->test dataset, by duplicating data items prior to doing a random split.
21 replies, 525 likes
Gilles Vandewiele: Our paper "Overly Optimistic Prediction Results on Imbalanced Data: Flaws and Benefits of Applying Over-sampling" has been published on arXiv: https://arxiv.org/abs/2001.06296
What did we do? A thread...
13 replies, 430 likes
Gilles Vandewiele: After discovering this, we were able to reproduce their results, but only when making a fundamental methodological flaw: applying over-sampling before partitioning data into training and testing set.
4 replies, 37 likes
Alex Dimakis: Do not apply over-sampling before splitting train and test
1 replies, 15 likes
alex rubinsteyn: Bad ML practices are pervasive in health & biomedical research. Intro stats needs to be scrapped in favor of a class that really convinces people to never touch the test set. Don't use test data for normalization, discretization, oversampling, &c. Lock the test data in a vault.
0 replies, 8 likes
Dillon Niederhut PhD: Evidently some medical researchers have been reporting accuracy on a holdout set set...
...after training on that holdout set 🙊.
If your model accuracy is 100%, be just a little bit skeptical.
0 replies, 4 likes
Flávio Clésio: This work shows why methodology matters, especially in sampling strategies.
0 replies, 3 likes
cj battey: Turns out putting exact copies of test samples in your training set makes the test loss look artificially low 🤦♂️🤦♂️🤦♂️.
0 replies, 2 likes
Found on Jan 20 2020 at https://arxiv.org/pdf/2001.06296.pdf