Question and Answer Test-Train Overlap in Open-Domain Question Answering Datasets


Patrick Lewis: New! Do you use NaturalQuestions, TriviaQA, or WebQuestions? It turns out 60% of test set answers are also in the train set. More surprising, 30% of test questions have a close paraphrase in the train set. What does it mean for models? Read to find out! 1/

Tim Dettmers: Turns out a lot of open-domain QA datasets have test set leakage. If you control for it, model performance drops by a mean absolute of 63%. Yikes! If we missed this for such a long time, I wonder if there are problems with other NLP datasets too.

Mark Sanderson: Data leakage found between the training and test splits of popular QA datasets. How to QA systems do when the leakage is removed? Much worse.

Verena Rieser: We found similar problems with train-test overlap when you delexicalise data-to-text datasets for #NLG with @tuetschek @_dmh

Barbara Plank: woah 😲! 60% of overlap and 30% close-paraphrases is extreme... from the paper ( "a greater emphasis should be placed on more behaviour-driven evaluation, rather than pursuing single-number overall accuracy figures." - yes! totally agree #beyondaccuracy

Philipp Cimiano: Interesting paper with an analysis of standard QA datasets, showing 30% train/test question overlap. We have been severely overestimating the ability of systems to answer novel questions!

Ted Pedersen: your periodic reminder to be skeptical of #nlproc leaderboards. they may tell us even less than we think about the problems they purport to be reporting on.

Jordan Boyd-Graber: Great paper! @Eric_Wallace_, @ihsgnef, @SeeTedTalk, and @ikuyamada had similar intuitions, but didn't test it out systematically, instead generating human in the loop QA challenge set avoiding duplicates from training data in the test set:

Elizabeth Merkhofer: @lousylinguist This?

