Papers of the day   All papers

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding


Mark Riedl | BLM: Google trained a 600B transformer neural model. Make it stop

44 replies, 924 likes

Dmitry (Dima) Lepikhin: We scaled the Transformer model with Sparsely-Gated Mixture-of-Experts using GShard, and trained a 600B multilingual translation model in about 4 days (for 100 languages) achieving 13.5 BLEU gain compared to the baseline.

9 replies, 369 likes

Stanford NLP Group: In case you haven’t heard, the new unit for measuring computation runtime is TPU core years. But, if you missed that memo, since the numbers are already in the hundreds, you may as well get ahead of the game and start quoting your runtimes in TPU core centuries #NLProc

4 replies, 278 likes

Jeff Dean (@🏡): Great work by @GoogleAI researchers @lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. 13.5 BLEU point gain is really significant!

2 replies, 186 likes

Oren Etzioni: Whoever has the most parameters when they die--wins! Now at 600B parameters: Microsoft, ClosedAI, Facebook--- what've you got?

7 replies, 89 likes

Brandon Rohrer: * successfully trains a model with over TEN THOUSAND parameters, shouts triumphantly, opens twitter *

4 replies, 80 likes

Ana Marasović: If you (like me) see a 600B model, and shriek, let me try to give you some consolation. Why should we care about ultra-large models? 1/n

1 replies, 77 likes

Delip Rao: “In weeks decades happen” 13+ BLeU point improvements in this new work. I have a feeling Google is quietly sitting on a GPT-100 implementation and doesn’t bother telling anyone.

2 replies, 73 likes

Nando de Freitas: Neural networks keep growing in size - this is an amazing time in history!

0 replies, 45 likes

Jeff Dean (@🏡): Training very large, 600B parameter research systems that show even greater translation quality:

1 replies, 38 likes

Maxim Raginsky: Kids these days and their fancy GPUs. Frank Rosenblatt, true mensch, would have done this with wires.

3 replies, 37 likes

Fernando Pereira: Tired: dense; wired (sort of like your noggin): sparse

0 replies, 32 likes

Alexis Conneau: Scaling Transformer models up to 600B parameters using sparsely-gated mixture-of-experts leads to big gains in BLEU score in multilingual machine translation Impressive work by Google folks

0 replies, 30 likes

Aran Komatsuzaki: GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding GShard enables to scale up multilingual NMT Transformer with Sparsely-Gated MoE beyond 600 billion parameters using automatic sharding.

3 replies, 13 likes

Shreya Shankar: Dear ML community, read past the abstract! Before you jadedly scream "why train a gajillion parameters on a gajillion TPUs," realize that @lepikhin et al. did so much more. Thread:

1 replies, 12 likes

Ankur Bapna: 600B parameter model, trains in 4 days on over 10 billion examples!! Amazing work on efficiently scaling up model capacities with sparsely-gated MoEs and SPMD by @lepikhin and others at @GoogleAI.

0 replies, 11 likes

Alexander Kruel: 600 billion parameters: "We demonstrate that such a giant model can efficiently be trained on 2048 TPU v3 accelerators in 4 days to achieve far superior quality for translation from 100 languages to English compared to the prior art."

1 replies, 10 likes

Loren Lugosch: And they tried training a model with 1 trillion weights, but "encountered several trainability issues with numerical stability, hence did not include the results for the sake of reproducibility".

1 replies, 9 likes

Hacker News: GShard: Scaling giant models with conditional computation and automatic sharding

0 replies, 9 likes

Hacker News: 600B parameter Transformer trained by Google

0 replies, 8 likes

William Falcon: ugh, i guess it’s time to report results normalized by compute... acc = acc/v100 hours reward efficiency gains, not praise brute force...

1 replies, 3 likes

Mark Douthwaite: I can't tell if I'm impressed or disturbed: @GoogleAI trained a SOTA language model on '2048 TPU v3 accelerators in 4 days'. This is a 600B parameter model. We only crossed the 10B mark a few months ago. Madness.

1 replies, 2 likes

Manuel Araoz: inb4 we see a model with more than 1T parameters in 2020

0 replies, 1 likes

Derek Chen: Want to go even bigger than the 175 billion parameters of GPT-3 Then you might be interested in the 600+ bil of GShard for NMT: Now it's a race to one trillion!

0 replies, 1 likes

Dominique Beaini: Is it the start of privatized #AI, where everyone will pay a fee for a gigantic pre-trained model? #Google recently trained a gigantic model, costing ~150k$ of TPU power to train the optimal model, and tens of millions for optimization #Pytorch #Tensorflow

1 replies, 0 likes


Found on Jul 02 2020 at

PDF content of a computer science paper: GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding