Papers of the day   All papers

GitHub Typo Corpus: A Large-Scale Multilingual Dataset of Misspellings and Grammatical Errors

Comments

Dec 02 2019 Masato Hagiwara

🎉Introducing GitHub Typo Corpus, a large-scale multilingual dataset of misspellings and grammatical errors. Contains 350k+ edits in 15+ languages. Code & Dataset https://github.com/mhagiwara/github-typo-corpus Paper: https://arxiv.org/abs/1911.12893 joint work w/ @chemical_tree at RIKEN AIP and Tohoku Univ.
5 replies, 738 likes


Dec 02 2019 Marcin Junczys-Dowmunt (Marian NMT)

When Roman Grundkiewicz and I built the first corpus of this kind from Wikipedia edits in 2014 (http://romang.home.amu.edu.pl/wiked/wiked.html) we didn't realize at all that we had created something useful. It went unnoticed for years, but it now seeing a revival and very nice follow-up work like this:
2 replies, 15 likes


Dec 16 2019 Peter Skomoroch

Projects to Know https://bit.ly/34ss702 curated for @AmplifyPartners this week by @sam_shah. Includes: GitHub Typo Corpus: Large-Scale Multilingual Dataset of Misspellings & Grammatical Errors https://arxiv.org/abs/1911.12893 Microsoft Presidio PII removal: https://github.com/microsoft/presidio
1 replies, 12 likes


Dec 03 2019 Dat Tran

Noice a new large-scale NLP dataset of mispellings and grammatical errors. Very cool 😎
0 replies, 7 likes


Dec 05 2019 fred

Can I finally get a fucking autocorrect sensitive to keyboard layout and actual key swap probabilities?!
1 replies, 6 likes


Dec 04 2019 Eugene Bagdasaryan

This is such a cool idea: parse GH for commit message: “fix typo”. Elegant way to build a great dataset! Hagiwara and Mita, “GitHub Typo Corpus: A Large-Scale Multilingual Dataset of Misspellings and Grammatical Errors” https://arxiv.org/abs/1911.12893 https://t.co/sBTFDqUQvD
1 replies, 5 likes


Dec 04 2019 Alberto Acerbi

Cultural evolutionists, what can we do with this?
0 replies, 5 likes


Dec 02 2019 arXiv CS-CL

GitHub Typo Corpus: A Large-Scale Multilingual Dataset of Misspellings and Grammatical Errors http://arxiv.org/abs/1911.12893
0 replies, 4 likes


Dec 11 2019 szenyo

commits: “the largest dataset of misspellings to date.”
0 replies, 2 likes


Content