Papers of the day   All papers

Question and Answer Test-Train Overlap in Open-Domain Question Answering Datasets

Comments

Patrick Lewis: New! Do you use NaturalQuestions, TriviaQA, or WebQuestions? It turns out 60% of test set answers are also in the train set. More surprising, 30% of test questions have a close paraphrase in the train set. What does it mean for models? Read https://arxiv.org/abs/2008.02637 to find out! 1/ https://t.co/jsW8qa3faL

5 replies, 421 likes


Tim Dettmers: Turns out a lot of open-domain QA datasets have test set leakage. If you control for it, model performance drops by a mean absolute of 63%. Yikes! If we missed this for such a long time, I wonder if there are problems with other NLP datasets too. https://arxiv.org/abs/2008.02637

5 replies, 393 likes


Mark Sanderson: Data leakage found between the training and test splits of popular QA datasets. How to QA systems do when the leakage is removed? Much worse. https://arxiv.org/abs/2008.02637

4 replies, 28 likes


Verena Rieser: We found similar problems with train-test overlap when you delexicalise data-to-text datasets for #NLG https://arxiv.org/abs/1911.03905 with @tuetschek @_dmh

0 replies, 20 likes


Barbara Plank: woah 😲! 60% of overlap and 30% close-paraphrases is extreme... from the paper (https://arxiv.org/pdf/2008.02637.pdf): "a greater emphasis should be placed on more behaviour-driven evaluation, rather than pursuing single-number overall accuracy figures." - yes! totally agree #beyondaccuracy

0 replies, 18 likes


Philipp Cimiano: Interesting paper with an analysis of standard QA datasets, showing 30% train/test question overlap. We have been severely overestimating the ability of systems to answer novel questions! https://arxiv.org/pdf/2008.02637.pdf

0 replies, 8 likes


Ted Pedersen: your periodic reminder to be skeptical of #nlproc leaderboards. they may tell us even less than we think about the problems they purport to be reporting on.

0 replies, 6 likes


Jordan Boyd-Graber: Great paper! @Eric_Wallace_, @ihsgnef, @SeeTedTalk, and @ikuyamada had similar intuitions, but didn't test it out systematically, instead generating human in the loop QA challenge set avoiding duplicates from training data in the test set: http://trickme.qanta.org

0 replies, 5 likes


Elizabeth Merkhofer: @lousylinguist This? https://twitter.com/Tim_Dettmers/status/1291739379887562753?s=20

1 replies, 1 likes


Content

Found on Aug 07 2020 at https://arxiv.org/pdf/2008.02637.pdf

PDF content of a computer science paper: Question and Answer Test-Train Overlap in Open-Domain Question Answering Datasets