Mark Riedl | BLM: Google trained a 600B transformer neural model.
Make it stop
44 replies, 924 likes
Dmitry (Dima) Lepikhin: https://arxiv.org/abs/2006.16668
We scaled the Transformer model with Sparsely-Gated Mixture-of-Experts using GShard, and trained a 600B multilingual translation model in about 4 days (for 100 languages) achieving 13.5 BLEU gain compared to the baseline. https://t.co/oOHRK7iiHm
9 replies, 369 likes
Stanford NLP Group: In case you haven’t heard, the new unit for measuring computation runtime is TPU core years. But, if you missed that memo, since the numbers are already in the hundreds, you may as well get ahead of the game and start quoting your runtimes in TPU core centuries #NLProc
4 replies, 278 likes
Jeff Dean (@🏡): Great work by @GoogleAI researchers @lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen.
13.5 BLEU point gain is really significant!
2 replies, 186 likes
Oren Etzioni: Whoever has the most parameters when they die--wins! Now at 600B parameters: https://arxiv.org/pdf/2006.16668.pdf Microsoft, ClosedAI, Facebook--- what've you got?
7 replies, 89 likes
Brandon Rohrer: * successfully trains a model with over TEN THOUSAND parameters, shouts triumphantly, opens twitter *
4 replies, 80 likes
Ana Marasović: If you (like me) see a 600B model, and shriek, let me try to give you some consolation. Why should we care about ultra-large models?
1 replies, 77 likes
Delip Rao: “In weeks decades happen”
13+ BLeU point improvements in this new work. I have a feeling Google is quietly sitting on a GPT-100 implementation and doesn’t bother telling anyone. https://arxiv.org/abs/2006.16668 https://t.co/qrmqQzVRux
2 replies, 73 likes
Nando de Freitas: Neural networks keep growing in size - this is an amazing time in history!
0 replies, 45 likes
Jeff Dean (@🏡): Training very large, 600B parameter research systems that show even greater translation quality:
1 replies, 38 likes
Maxim Raginsky: Kids these days and their fancy GPUs. Frank Rosenblatt, true mensch, would have done this with wires. https://t.co/bdr5kXgEaH
3 replies, 37 likes
Fernando Pereira: Tired: dense; wired (sort of like your noggin): sparse
0 replies, 32 likes
Alexis Conneau: Scaling Transformer models up to 600B parameters using sparsely-gated mixture-of-experts leads to big gains in BLEU score in multilingual machine translation
Impressive work by Google folks https://t.co/k4FZjHjkhh
0 replies, 30 likes
Aran Komatsuzaki: GShard: Scaling Giant Models with Conditional
Computation and Automatic Sharding
GShard enables to scale up multilingual NMT Transformer with Sparsely-Gated MoE beyond 600 billion parameters using automatic sharding.
3 replies, 13 likes
Shreya Shankar: Dear ML community, read past the abstract! Before you jadedly scream "why train a gajillion parameters on a gajillion TPUs," realize that @lepikhin et al. did so much more. Thread:
1 replies, 12 likes
Ankur Bapna: 600B parameter model, trains in 4 days on over 10 billion examples!!
Amazing work on efficiently scaling up model capacities with sparsely-gated MoEs and SPMD by @lepikhin and others at @GoogleAI.
0 replies, 11 likes
Alexander Kruel: 600 billion parameters: https://arxiv.org/abs/2006.16668
"We demonstrate that such a giant model can efficiently be trained on 2048 TPU v3 accelerators in 4 days to achieve far superior quality for translation from 100 languages to English compared to the prior art."
1 replies, 10 likes
Loren Lugosch: And they tried training a model with 1 trillion weights, but "encountered several trainability issues with numerical stability, hence did not include the results for
the sake of reproducibility".
1 replies, 9 likes
Hacker News: GShard: Scaling giant models with conditional computation and automatic sharding https://arxiv.org/abs/2006.16668
0 replies, 9 likes
Hacker News: 600B parameter Transformer trained by Google https://arxiv.org/abs/2006.16668
0 replies, 8 likes
William Falcon: ugh, i guess it’s time to report results normalized by compute...
acc = acc/v100 hours
reward efficiency gains, not praise brute force...
1 replies, 3 likes
Mark Douthwaite: I can't tell if I'm impressed or disturbed: @GoogleAI trained a SOTA language model on '2048 TPU v3 accelerators in 4 days'. This is a 600B parameter model. We only crossed the 10B mark a few months ago. Madness.
1 replies, 2 likes
Manuel Araoz: inb4 we see a model with more than 1T parameters in 2020
0 replies, 1 likes
Derek Chen: Want to go even bigger than the 175 billion parameters of GPT-3 https://arxiv.org/abs/2005.14165? Then you might be interested in the 600+ bil of GShard for NMT: https://arxiv.org/abs/2006.16668 Now it's a race to one trillion!
0 replies, 1 likes
Dominique Beaini: Is it the start of privatized #AI, where everyone will pay a fee for a gigantic pre-trained model?
#Google recently trained a gigantic model http://arxiv.org/abs/2006.16668, costing ~150k$ of TPU power to train the optimal model, and tens of millions for optimization
1 replies, 0 likes
Found on Jul 02 2020 at https://arxiv.org/pdf/2006.16668.pdf