Papers of the day   All papers

Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping

Comments

Jake VanderPlas: The frequency of random seeds between 0 and 1000 on github (data from http://grep.app) https://t.co/Zmp7mwMWil

60 replies, 1520 likes


Jesse Dodge: Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping https://arxiv.org/abs/2002.06305 We found surprisingly large variance just from random seeds when fine-tuning BERT. Both weight inits and the order of the training data have big impact. 1/n

14 replies, 463 likes


Daniel Roy: Your SOTA code may only be SOTA for some random seeds. Nonsense or new reality? I suppose there are trivial ways to close the gap using restarts and validation data. https://arxiv.org/abs/2002.06305 https://t.co/mzRGLGH4ZV

23 replies, 311 likes


Clément Canonne: Controversial opinion: We should stop teaching students any ML methods other than picking a random seed. There is so much inertia against forgetting things that never panned out. It's hard to let go of ideas, especially one's own, but our aim should be to make real progress in AI

5 replies, 258 likes


Thomas Wolf: Happy to see Dodge et al. (http://arxiv.org/abs/2002.06305) settling this question once and for all The best random seed is 12 A major part of the Deep Learning Research Program can now be considered solved *rubs his hands together* https://t.co/NrD4xOZAee

11 replies, 183 likes


Marcin Junczys-Dowmunt (Marian NMT): MT people, your BLEU values can vary by 1 point or more based on random seed choice as well. So when you report your results without investigating that you have no idea what you are actually reporting.

4 replies, 65 likes


Gabriel Ilharco: New paper out! In NLP, fine-tuning large pretrained models like BERT can be a very brittle process. If you're curious about this, this paper is for you! https://arxiv.org/pdf/2002.06305.pdf Work with the amazing @JesseDodge, @royschwartz02, Ali Farhadi, @HannaHajishirzi & @nlpnoah 1/n

1 replies, 65 likes


Charles Sutton: Of course if you call it “tuning the random seed”, it sounds silly. Is it really? Commonly you need to do random restarts in global optimization. That’s what changing the seed is. Why should that bother us?

6 replies, 57 likes


Anna Rogers: Dear people maintaining leaderboards: what do you recommend to do about variance due to model inits? @robinomial @sleepinyourhat @nlpmattg We were about to release some data. Then @JesseDodge @nlpnoah did this: https://arxiv.org/abs/2002.06305 And now it's all darkness and misery.

4 replies, 38 likes


Thomas Wolf: Check out the paper, it's a great read: "Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping" by @JesseDodge @gabriel_ilharco @royschwartz02 Ali Farhadi, Hannaneh Hajishirzi and @nlpnoah http://arxiv.org/abs/2002.06305

1 replies, 29 likes


Yonatan Belinkov: This is the kind of “common knowledge” that I’ve heard floating around, but not really documented. It’s great to have a detailed study.

1 replies, 20 likes


Leshem Choshen: For anyone outside academia, you probably noticed that BERTs differ by seeds. https://arxiv.org/pdf/2002.06305.pdf quantify by how much. Suggestions: 1. take the best of +-7 2. try many, stop ones that show no promise early on @royschwartz02 @nlpnoah @alifarhadi @JesseDodge @gabriel_ilharco

1 replies, 17 likes


Noah Smith: New work on the roles of random seeds in fine-tuning by @JesseDodge, @gabriel_ilharco, @royschwartz02, Ali Farhadi, @HannaHajishirzi, and @nlpnoah

0 replies, 16 likes


Amirhossein Tebbifakhr: Random seed impact fine-tuning BERT, https://arxiv.org/pdf/2002.06305.pdf suggests, fine-tune many, stop early non-promising ones, and continue some. by: @JesseDodge @gabriel_ilharco @royschwartz02 @alifarhadi @nlpnoah cc: @fbk_mt

0 replies, 11 likes


Pasquale Minervini: @zacharylipton @IgorCarron "Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping" - https://arxiv.org/abs/2002.06305 (although @huggingface's source code reads extremely well, almost like a paper)

0 replies, 9 likes


Jeff Dalton: A bit scary 😱 that random seeds and data order should matter...

1 replies, 5 likes


MT Group at FBK: Our pick of the week: @JesseDodge et al. paper on "Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping". By @at_amir https://arxiv.org/pdf/2002.06305.pdf #nlproc #deeplearning #bert @gabriel_ilharco @royschwartz02 @HannaHajishirzi @nlpnoah

0 replies, 5 likes


ML and Data Projects To Know: 📙 Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping Authors: @JesseDodge, @gabriel_ilharco, @royschwartz02, Ali Farhadi, @HannaHajishirzi, @nlpnoah Paper: https://arxiv.org/abs/2002.06305

0 replies, 3 likes


DrHB: Finally article about importance of random seed :) https://arxiv.org/abs/2002.06305 good read, interesting results:)

0 replies, 2 likes


Djamé: real question: I'm most certainly missing something but how come people are surprised by the variance between results linked to different random seeds? in the pre-deep learning parsing era, this was a given fact. (Petrov, 2010) https://www.aclweb.org/anthology/N10-1003.pdf https://t.co/vGRSIIbW3L

1 replies, 1 likes


La Forge AI: [2002.06305] Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping via @IgnavierN @Minthos_ @jstnclmnt @ceobillionaire https://arxiv.org/abs/2002.06305

0 replies, 1 likes


Content

Found on Apr 08 2020 at https://arxiv.org/pdf/2002.06305.pdf

PDF content of a computer science paper: Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping