Papers of the day   All papers

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

Comments

Mark Riedl | BLM: Google trained a 600B transformer neural model. Make it stop https://arxiv.org/abs/2006.16668

44 replies, 924 likes


Dmitry (Dima) Lepikhin: https://arxiv.org/abs/2006.16668 We scaled the Transformer model with Sparsely-Gated Mixture-of-Experts using GShard, and trained a 600B multilingual translation model in about 4 days (for 100 languages) achieving 13.5 BLEU gain compared to the baseline. https://t.co/oOHRK7iiHm

9 replies, 369 likes


Stanford NLP Group: In case you haven’t heard, the new unit for measuring computation runtime is TPU core years. But, if you missed that memo, since the numbers are already in the hundreds, you may as well get ahead of the game and start quoting your runtimes in TPU core centuries #NLProc

4 replies, 278 likes


Jeff Dean (@🏡): Great work by @GoogleAI researchers @lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. 13.5 BLEU point gain is really significant!

2 replies, 186 likes


Oren Etzioni: Whoever has the most parameters when they die--wins! Now at 600B parameters: https://arxiv.org/pdf/2006.16668.pdf Microsoft, ClosedAI, Facebook--- what've you got?

7 replies, 89 likes


Brandon Rohrer: * successfully trains a model with over TEN THOUSAND parameters, shouts triumphantly, opens twitter *

4 replies, 80 likes


Ana Marasović: If you (like me) see a 600B model, and shriek, let me try to give you some consolation. Why should we care about ultra-large models? 1/n

1 replies, 77 likes


Delip Rao: “In weeks decades happen” 13+ BLeU point improvements in this new work. I have a feeling Google is quietly sitting on a GPT-100 implementation and doesn’t bother telling anyone. https://arxiv.org/abs/2006.16668 https://t.co/qrmqQzVRux

2 replies, 73 likes


Nando de Freitas: Neural networks keep growing in size - this is an amazing time in history!

0 replies, 45 likes


Jeff Dean (@🏡): Training very large, 600B parameter research systems that show even greater translation quality: https://arxiv.org/abs/2006.16668

1 replies, 38 likes


Maxim Raginsky: Kids these days and their fancy GPUs. Frank Rosenblatt, true mensch, would have done this with wires. https://t.co/bdr5kXgEaH

3 replies, 37 likes


Fernando Pereira: Tired: dense; wired (sort of like your noggin): sparse

0 replies, 32 likes


Alexis Conneau: Scaling Transformer models up to 600B parameters using sparsely-gated mixture-of-experts leads to big gains in BLEU score in multilingual machine translation https://arxiv.org/pdf/2006.16668.pdf Impressive work by Google folks https://t.co/k4FZjHjkhh

0 replies, 30 likes


Aran Komatsuzaki: GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding GShard enables to scale up multilingual NMT Transformer with Sparsely-Gated MoE beyond 600 billion parameters using automatic sharding. https://arxiv.org/abs/2006.16668 https://t.co/KiDoJogk3x

3 replies, 13 likes


Shreya Shankar: Dear ML community, read past the abstract! Before you jadedly scream "why train a gajillion parameters on a gajillion TPUs," realize that @lepikhin et al. did so much more. Thread:

1 replies, 12 likes


Ankur Bapna: 600B parameter model, trains in 4 days on over 10 billion examples!! Amazing work on efficiently scaling up model capacities with sparsely-gated MoEs and SPMD by @lepikhin and others at @GoogleAI.

0 replies, 11 likes


Alexander Kruel: 600 billion parameters: https://arxiv.org/abs/2006.16668 "We demonstrate that such a giant model can efficiently be trained on 2048 TPU v3 accelerators in 4 days to achieve far superior quality for translation from 100 languages to English compared to the prior art."

1 replies, 10 likes


Loren Lugosch: And they tried training a model with 1 trillion weights, but "encountered several trainability issues with numerical stability, hence did not include the results for the sake of reproducibility".

1 replies, 9 likes


Hacker News: GShard: Scaling giant models with conditional computation and automatic sharding https://arxiv.org/abs/2006.16668

0 replies, 9 likes


Hacker News: 600B parameter Transformer trained by Google https://arxiv.org/abs/2006.16668

0 replies, 8 likes


William Falcon: ugh, i guess it’s time to report results normalized by compute... acc = acc/v100 hours reward efficiency gains, not praise brute force...

1 replies, 3 likes


Mark Douthwaite: I can't tell if I'm impressed or disturbed: @GoogleAI trained a SOTA language model on '2048 TPU v3 accelerators in 4 days'. This is a 600B parameter model. We only crossed the 10B mark a few months ago. Madness.

1 replies, 2 likes


Manuel Araoz: inb4 we see a model with more than 1T parameters in 2020

0 replies, 1 likes


Derek Chen: Want to go even bigger than the 175 billion parameters of GPT-3 https://arxiv.org/abs/2005.14165? Then you might be interested in the 600+ bil of GShard for NMT: https://arxiv.org/abs/2006.16668 Now it's a race to one trillion!

0 replies, 1 likes


Dominique Beaini: Is it the start of privatized #AI, where everyone will pay a fee for a gigantic pre-trained model? #Google recently trained a gigantic model http://arxiv.org/abs/2006.16668, costing ~150k$ of TPU power to train the optimal model, and tens of millions for optimization #Pytorch #Tensorflow

1 replies, 0 likes


Content

Found on Jul 02 2020 at https://arxiv.org/pdf/2006.16668.pdf

PDF content of a computer science paper: GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding