Papers of the day   All papers

Language Models are Few-Shot Learners


𝔊𝔴𝔢𝔯𝔫: GPT-3 is terrifying because it's a tiny model compared to what's possible, trained in the dumbest way possible on a single impoverished modality on tiny data, yet the first version already manifests crazy runtime meta-learning—and the scaling curves 𝘴𝘵𝘪𝘭𝘭 are not bending! 😮

28 replies, 968 likes

Michael Nielsen: Spent an enjoyable few hours digging into GPT-3, trying to better understand how it works, what the limits are, how it may be improved. The paper is here:

14 replies, 905 likes

hardmaru: GPT-3: Language Models are Few-Shot Learners, by @notTomBrown et al. “We train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting.”

13 replies, 547 likes

Ben Mann: We just published our paper on GPT-3! Proud to be part of this awesome team!

6 replies, 498 likes

Tom Brown: Language models are few shot learners! We find that larger models can often (but not always) perform NLP tasks given only natural language prompt and a few examples in the context. No fine-tuning. Paper: Illustrated summary ⬇️ (1/12)

15 replies, 468 likes

Natasha Jaques: GPT-3 is conjugating words that don't exist

12 replies, 394 likes

Mitchell Gordon: Papers like these make me feel like we're all telegraph engineers in the pre-Shannon era. Back in the day, if you had trouble getting the signal through, you just bumped up the amplitude. It kind of helped. (1/2)

2 replies, 329 likes

Nando de Freitas: This brilliant ⁦@OpenAI⁩ work and the video of ⁦@karpathy⁩ I shared recently are very exciting AI frontiers. The story repeats itself: Big net, curated data, and common sense are the ingredients. Congrats ⁦@ilyasut⁩ et al.

3 replies, 266 likes

Mark Riedl wears pants during video calls: GPT-3 has 175 billion parameters, trained on 300 billion tokens

13 replies, 251 likes

Alfredo Canziani: «GPT-3» is out! 🤓 With 175 billion parameters and 4 bytes per parameters / gradient it takes *only* 1.4 TB on your GPU 🤔 As comparison, a cat 🐱 cortex 🧠 has only 20× more synapses.

6 replies, 219 likes

Rogue P. Bigham: i think i'm going to wait until GPT-4 to upgrade. seems like a mid-cycle release. trillion parameters or bust.

4 replies, 186 likes

Oriol Vinyals: Scale *still* delivers! Congrats @OpenAI on showing very nice zero/few-shot language capabilities of GPT-3. #timelesstweet Paper: Endless Samples:

1 replies, 167 likes

NLP for Development: "In collecting training data for GPT-3, we made no effort to select either for or against foreign languages" Meaning: At @OpenAI we make no effort with language representation and show our indifference by using pejoratives like "foreign languages"

4 replies, 125 likes

Sebastian Gehrmann from far away: The ELMo paper? 15 pages. BERT? 16 pages. GPT-2? 24 pages. T5? 53 pages. GPT-3?? 72 pages! Showing once and for all that paper sizes keep growing. We really should be concerned about the energy implications, poor trees :(

5 replies, 125 likes

Jonathan Fly 👾: GPT-3: Language Models are Few-Shot Learners The new 175 Billion Parameter GPT-3 excels at a battery of NLP benchmarks (translation, question-answering, etc) with prompting alone -- no fine-tuning. Awaiting more samples! abs: pdf:

9 replies, 121 likes

Leon Derczynski: If GPT3 took 50 petaflop-days to train, w. GPUs at 10^8 flops per watt, so those 1.2E18 flop-hours used 12 GWh to train? E.g. 12 hours of a whole nuclear reactor? At 0.73kg per kWh that's.. 8.8 megatons of CO2?! #sanitycheck #nlproc

9 replies, 113 likes

Aran Komatsuzaki: Language Models are Few-Shot Learners - GPT 3 (175B params) causal LM - matches with sota fine-tuned performance with few-shot learning on various tasks - can write indistinguishable news articles

2 replies, 110 likes

Amanda Askell: I recently worked on human evaluations of GPT-3 with @girishsastry. We found that people’s ability to distinguish model generated news articles from human written news articles approaches chance as model size increases.

3 replies, 105 likes

Two Minute Papers 📜: OpenAI GPT-3 - Good At Almost Everything! 🤖 ▶️Full video (ours): 📜Source paper: ❗Source tweet: #ai #deeplearning #science #twominutepapers #neuralnetworks #machinelearning #gpt2 #gpt3 #gpt-3 #openai

3 replies, 104 likes

roadrunner01: GPT-3 is here 😮

3 replies, 103 likes

Robert (Munro) Monarch: Hey @OpenAI folk. I spent many hours working with you on GPT-2 to make sure you were #benderrule compliant and talked about language representation appropriately. You seem to have forgotten everything I taught you. Also, the internet is not "a natural distribution of languages"

2 replies, 87 likes

roadrunner01: Language Models are Few-Shot Learners pdf: abs: github:

2 replies, 81 likes

Graham Neubig: Large model/hardware trivia: Google's new TPU supercomputer ( could potentially train GPT-3 ( in about 7.5 days. Actually a bit longer than I expected. (GPT-3 175B model requires 3.14E+23 flops, Google cluster does 480PFLOPs/s)

1 replies, 80 likes

Gavin Baker: 1) GPT-3 and the higher semiconductor intensity of AI: This graph of the compute used to train different AI models looks like it is growing exponentially, but it is already scaled *logarithmically*

2 replies, 76 likes

Gautam Kamath: Timely, given the discussion the other day about author order (@neu_rips). @OpenAI puts out a 31 author paper on GPT-3 ( 1. Choosing author order in a group this large is something I want no part of; 2. They include a list of what every person contributed

8 replies, 63 likes

Two Minute Papers 📜: OpenAI GPT-3 - Good At Almost Everything! 🤖 ▶️Full video (ours): 📜Source paper: ❗Source tweet: #ai #deeplearning #science #twominutepapers #neuralnetworks #machinelearning #gpt2 #gpt3 #gpt-3 #openai

5 replies, 62 likes

Richard Socher: Great new paper by @OpenAI on a massive Transformer Language Model for Controllable Generation and Multitask Learning There are 3 equivalent super tasks of NLP: Language models, dialogue systems and question answering. LMs have the most training data->win.

0 replies, 53 likes

Kirk Borne: The amazing @OpenAI GPT-3 #AI text-generation API has been in the news a lot lately: 1) 2) 3) 4) Research Paper: #BigData #DataScience #MachineLearning #NLG #AGI

1 replies, 45 likes

Tom Brown: I encourage y’all to read (or at least skim) the paper. I’m really proud to have had a part in creating this work over the last 18 months and am glad to get to share it with you. Paper: Samples & Data: (12/12)

2 replies, 44 likes

Sam Bowman: So, GPT-3 is out. From a first glance: The news generation and LAMBADA results are *really* impressive. I'm also a little disappointed not to see any fine-tuning experiments. Labeled data is pretty cheap! How much better would we do if we used it?

6 replies, 43 likes

John Shedletsky: Amazing AI-generated article from the GPT-3 paper ( #IAmAShapeshifter #YouCouldHaveWornTheTux

5 replies, 41 likes

Grady Booch: Deep fakes at scale. But with text, not images or videos.

3 replies, 30 likes

shanley: In bad news for the internet, "we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans."

2 replies, 29 likes

Sam McCandlish: Proud to be a part of this exciting project led by Dario: We applied our scaling laws to train a highly adaptable model that can do Q&A, translation, and even poetry generation – all without any fine-tuning!

0 replies, 28 likes

Jon, from Videogames: The GPT-3 paper is out.

1 replies, 28 likes

Sushant Kumar: 4/n Also, GPT-3 is stochastic. So, that would mean every time it's given a word, it can come up with a different tweet. The stochasticity can be varied using the temperature parameter between 0 and 1. More on that in the official paper here:

3 replies, 23 likes

Xander Steenbrugge: 175 𝘽𝙞𝙡𝙡𝙞𝙤𝙣 parameters.. really? Look, I'm all down for using overparameterized neural nets to solve hard tasks, but this is starting to get very impractical to run.. (maybe that's the point.. 🤔) Someone please tame this beast by pruning it down to a usable size 😅

3 replies, 18 likes

Aza Raskin: OpenAI just released GPT-3, which "can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans". This is not going to just end end poorly, but begin and middle poorly.

2 replies, 17 likes

Nick Diakopoulos: "Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans." -- It's a 10x larger model than GPT-2

0 replies, 17 likes

Sriram Krishnan: Great thread on GPT-3 strengths and weaknesses ( in case you haven’t seen any GPT-3 related tweets in your timeline already 😏)

0 replies, 17 likes

Apoorv Nandan: Turns out a model trained to predict the next word on billions of sentences learns to respond to instructions. For eg. input: translate english to french, cheese toast output: fromage au toast Zero shot. No fine tuning needed. 🤯 #gpt3

1 replies, 14 likes

swapp 🥭: How to detect a fake bot, ask it to define a made up word and see if it is successful in defining it without any hesitation

2 replies, 14 likes

Jack Hessel: Gargantuan effort from OpenAI --- really cool findings re: what scale can bring! + an unforeseen solution for LM release ethics: It can't be used for bad if no one can load it into memory (GPT-3 weights are 270GB assuming half-point floats) ;)

1 replies, 14 likes

Sam Finlayson: Has anyone run the numbers yet on the financial and carbon cost of training this big kahuna?

5 replies, 13 likes

ralph waldo cybersyn: using the world's most advanced computer systems and algorithms, top scientists have devised a way to remove borat voice from any english sentence

0 replies, 13 likes

plotly: Unlike examples that involve HTML/JSX, it is unlikely that GPT-3 was pre-trained on many annotated PX code samples. For this reason, it's really interesting to see its few-shot learning capabilities in action, which a substantial finding from the paper:

2 replies, 13 likes

AI 212: OpenAI GPT-3 with 175 billion parameters . Language Models are Few-Shot Learners #GPT3 #Tensorflow #NLU #Pytorch #Python #AI #NLP #OpenAI

0 replies, 12 likes

no love deep learning: #gpt3 also has 30 authors, which implies that each author was responsible to collect ~10 billion tokens and personally train 5.84 billion parameters

1 replies, 10 likes

Brian Roemmele: On October 7th, 2005 I began using protocols in #TheIntelligenceAmplifier that is now captivating the Silicon Valley and VC world. Pre-trained language representations of NLP system called generative pre-training or GPT-3. You will hear a lot about it.

2 replies, 10 likes

brain mentality: this is kinda fucked

4 replies, 10 likes

Christian Wolf: 175 Billion parameters, academia can't compete anymore with this insane requirements of compute... Also, 50 petaflop/s-days is a strange unit. => 24*60*60*50 = 4320000 petaflop => 4320 exaflop => 4.3 zettaflop #GPT3

0 replies, 9 likes

Ste𝔣an 🖥️🎧⚡: GPT-3 😱 "Language Models are Few-Shot Learners"

1 replies, 9 likes

arXiv CS-CL: Language Models are Few-Shot Learners

0 replies, 9 likes

Sushant Kumar: @mnpinto_ @OpenAI @gdb It defintely was trained on crawl data of the web and books. So, it’s quite possible that this could have come from training data. Good find.

0 replies, 8 likes

Daniel Hoadley ⚫️: GPT-3 is impressive. Extraordinarily impressive in fact. But hyperbolic tweets like this really irritate me. And nowhere in this thread do I see mention of the original paper ( or even more specifically section 6 of the paper

3 replies, 8 likes

Hacker News: “GPT-3: Language Models Are Few-Shot Learners”, Brown et al. 2020 (OpenAI)

0 replies, 8 likes

Edward Dixon: A behemoth of a model from a behemoth of a team. @OpenAI 's GPT-3: "For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model." @CShorten30 @seb_ruder, #NLP. Wow!

0 replies, 8 likes

Jonathan Oppenheim: Great thread on GPT-3 by @michael_nielsen without the hype.

0 replies, 7 likes

Dr. Eli David: AI models are becoming larger at a staggering rate, making computational requirements a huge bottleneck for real-world deployment. Our brain has 1000x more weights than GPT-3, but a power consumption of under 25 watts at peak performance, i.e., a small fraction of single GPU.

1 replies, 7 likes

Jacob Buckman: Something I really like about this work is its implications for RL in POMDPs. This is evidence that we will get a lot of complex behaviors "for free" by just using a giant model that encodes the history.

2 replies, 7 likes

Rishabh @ Home 🎉: Damn GPT-3 just came out 😱😱😱

0 replies, 5 likes

Alejandro Piad-Morffis: Stay curious 🖖: - 📃 <> - 🗞️ <> - 💻 <> - 🎥 <> - 🎥 <>

0 replies, 5 likes

Prof. Anima Anandkumar: @CliffRayman @OpenAI @Microsoft @Twitter The GPT-3 paper itself admits to #AI #bias but does not recommend any mitigation strategies

2 replies, 5 likes

Natesh Ganesh: Given these numbers, all this talk of AI & ML democratization sounds sillier with every new bigger model.

0 replies, 5 likes

Daniel Roy: Quite an extensive Broader Impact statement there. Haven't read it closely, but curious to hear what people think.

1 replies, 5 likes

Marco De Nadai: Deep models vs CO2

0 replies, 5 likes

Richard Minerich: It might be hard to understate what a big deal this GPT-3 result is, few shot learning changes everything. Being a "data company" is suddenly much less of a moat in many cases. This might be the beginning of a huge explosion in NLP.

3 replies, 5 likes

Mark Sanderson: Section 6 of this GPT-3 paper discusses potential language model misuse, how gender, race, and religion is represented in the model, as well as the energy used to form it. Thanks to @hannahbast for the pointer to this welcome addition.

0 replies, 4 likes

Pujaa Rajan | Black Lives Matter: 🤯 Technical Takeaways Zero-shot performance improves steadily with model size. Few-shot performance increases more rapidly. Larger models are better at in-context learning. Graph from paper: (9/13)

1 replies, 4 likes

arXiv CS-CL: Language Models are Few-Shot Learners

0 replies, 4 likes

Convaise: With #GPT3, a few-shot learner and one of the largest language models ever trained, @OpenAI sets new standards in multiple #NLP tasks. We're exited to see how such extensive models can be used efficiently in production!

0 replies, 4 likes

Katelyn Gadd: Machine Learning is truly a nightmare (from a GPT-3 paper,

0 replies, 4 likes

Data & Cyber Trends: The amazing @OpenAI GPT-3 #AI text-generation API has been in the news a lot lately: 1) 2) 3) 4) Research Paper: #BigData #DataScience #MachineLearning #NLG #AG

0 replies, 4 likes

DelocalizedDanny: #MachineLearning sanitycheck...time to improve AI such that we can do the same with less data? #smartAI

0 replies, 3 likes

Hani 🧢: GPT-3 is a new gigantic language model from @openai and it will blow your mind. Just a few examples written in plain English is enough for the model to learn a new task, without any special training for it first! (Model input grey, model output black)

0 replies, 3 likes

미키베어: GPT-3: Language Models are Few-Shot Learners "... we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model..."

1 replies, 3 likes

arXiv CS-CL: Language Models are Few-Shot Learners

0 replies, 3 likes

arXiv CS-CL: Language Models are Few-Shot Learners

0 replies, 3 likes

Jaime Sevilla:

0 replies, 3 likes

Bill Grosso: The GPT-3 paper is astonishingly readable.

0 replies, 3 likes

Peter Burns: @genuine_doubt Oh, the paper answers the first: > GPT-3 175B [can generate] 100 pages of content from a trained model can cost on the order of 0.4 kW-hr So ~2,000 pages of output per dollar Divide by 10 for capital, profit margin, etc, and ~200 pages per dollar

0 replies, 3 likes

arXiv CS-CL: Language Models are Few-Shot Learners

0 replies, 3 likes

Merzmensch Kosmopol: @MadBMan @OpenAI They don't tell exact sources, but these are huge Data from the Internet 2016-2019 (paper: At the end there are 570GB of text. Testing GPT.-3 for knowledge we can see, there is almost everything. It even can write letters in Russian of XVIIIth cnt.

0 replies, 3 likes

Alfredo Canziani: Full summary from first author @nottombrown follows.

1 replies, 2 likes

Moiz Saifee: #DeepLearning models keep on getting bigger and better but 175B parameters is crazy even from Deep Learning's standards #NLP #DataScience

0 replies, 2 likes

Rodrigo Agerri: Every language is foreign to English, and the Internet as a messy of natural distribution of languages. Wow

1 replies, 2 likes

QC: in these troubled times please enjoy some screenshots of GPT-3 poetry; it was asked to write a poem called Shadows on the Way in the style of Wallace Stevens

1 replies, 2 likes

Adi Fuchs: Bitcoin 2018: Our computation costs more than Austria’s electricity bill! NLP 2020: hold my beer. #gpt3

0 replies, 2 likes

Daniel Hoadley ⚫️: @ines_curt @alexgsmith @mengwong @StewieKee @DohertyLawTeach @lawheroez @jbrowder1 @scarlettyard @sally_iaccm @tcummins @Akoneira If you’re interested in this, I’d really recommend taking a look at the GPT-3 paper.

0 replies, 2 likes

StructuredStories: Open AI just published a 72-page paper on GPT-3 - a 175 billion parameter language model. "for news articles that are around 500 words long, GPT-3 continues to produce articles that humans find difficult to distinguish from human written news articles"

1 replies, 2 likes

Derek Chen: Want to go even bigger than the 175 billion parameters of GPT-3 Then you might be interested in the 600+ bil of GShard for NMT: Now it's a race to one trillion!

0 replies, 1 likes

Huaiyu Khaw: The GPT-3 paper just landed on ArXiv: 🤯

0 replies, 1 likes

Convaise: With #GPT3, a few-shot learner and one of the largest language models ever trained, @OpenAI sets new standards in multiple #NLP tasks, while falling short on others. We're exited to see how such large models can be used efficiently in practice!

0 replies, 1 likes

J. Harry Caufield: Finally going to try reading that GPT-3 paper

1 replies, 1 likes

Sushant Kumar: @Travpreneur The large chunk of the training data was web corpus.

0 replies, 1 likes

Balazs Tarjan: One the most exciting results (and maybe the most terrifying) from the new GPT-3 paper ( is that people's ability to identify whether news articles are model-generated decreases to the level of random guessing for the largest model (175B parameters!)

0 replies, 1 likes

arXiv CS-CL: Language Models are Few-Shot Learners

0 replies, 1 likes

Sam Charrington: Language models getting better at writing academic papers

0 replies, 1 likes

Jorge Bravo: Truly impressed by this recent AI breakthrough: a 175-billion parameters NLP model developed by @OpenAI. Huge potential also in the scientific domain!

0 replies, 1 likes

Atis Elsts: GPT-2 had a good run. Now GPT-3 is released. I look forward to being entertained, amazed, and baffled by even higher quality auto-generated writing!

0 replies, 1 likes

David Doswell: @wesyang A natural language processing (NLP) neural network for generating text. It is not “intelligent,” but it can simulate intelligent responses—which is often indistinguishable in practice. Technical paper on the motivations and ideas

1 replies, 1 likes

rohan paul: GPT-3's model is made up of 175 billion parameters For comparison, GPT-2 was 1.5 billion and the pre-GPT-3 largest Transformer-based language model released by Microsoft (Turing NLG) one month earlier was 17 billion parameters #GPT3 #MachineLearning

0 replies, 1 likes

Thomas Miconi: 1- Few-shot learning with zero gradient update is really cool. 2- 175 billions. With a b.

1 replies, 1 likes

arXiv CS-CL: Language Models are Few-Shot Learners

0 replies, 1 likes

Timothy O'Hear: Deep learning models' inability to learn tasks without a large quantity of very specific data is a bit of a myth. But this takes it to a new level. On the graph below: 10^0 means "1" and 10°1 means "10" 😮

0 replies, 1 likes

Dawn Anderson: @bill_slawski @YuriyYarovoy @MordyOberstein And this is probably amongst the only things to read on that:

0 replies, 1 likes

Rafael Cosman: For people that don't know what #GPT3 is, I highly recommend checking it out!

0 replies, 1 likes

Daisuke Okanohara: GPT-3 is the largest non-sparse language model with 175 billion parameters. Without fine-tuning, GPT-3 can solve many NLP tasks to some extent just by adding a few (or zero) examples as additional input and predict the text following the question.

0 replies, 1 likes

Mohamed Omran: In other words: Our 175B-parameter model, which basically memorises the English-speaking Internet, does well on natural language tasks with next to no extra training. Two things I find remarkable here:

1 replies, 0 likes

Vikas V Patil: @OpenAI research on GPT-3 language model is gigantic! It's a great step up in the field of #NLP. It has 175 billion parameters 🤓 -

1 replies, 0 likes

Fabon Dzogang: One in two people will be convinced that #gpt-3 was actually Human when reading its fake stories. This is the result of training 175 billion parameters, on on 300 billion token occurrences. #AI #MachineLearning

1 replies, 0 likes

Paul O: The gender bias of GPT-3 was analyzed. "He was very", "She was very" was given as a seed & GPT-3 filled in the rest. The GPT-3 Ai was trained using >trillion words, scraped from the public internet. The result is a disturbing snapshot of the internet.

1 replies, 0 likes


Found on May 31 2020 at

PDF content of a computer science paper: Language Models are Few-Shot Learners