Papers of the day   All papers

Big Bird: Transformers for Longer Sequences


AK: Big Bird: Transformers for Longer Sequences 🐦 pdf: abs:

3 replies, 157 likes

Delip Rao: That didn’t take long at all! As predicted in the recent @pagestlabs issue, long-span attention cost for transformer models like GPT-3 and T5 came down from O(N√N) to O(N) in BigBird. Looking forward to these models becoming viable for everyone to build.

2 replies, 57 likes

elvis: Big Bird is a transformer-based model that more effectively supports NLP tasks requiring longer contexts. It satisfies the theoretical properties of the full model while reducing the attention mechanism complexity to linear in # of tokens.

0 replies, 57 likes

elvis: I am noticing a lot of ML papers these days propose learning frameworks that are inspired by graph theory. BigBird, a recent effort to enable Transformers to deal with longer sequences, reduces the complexity of self-attention by exploiting graph theory.

1 replies, 49 likes

Kevin Lacker: This looks like Google’s best AI text-generator yet. Since there is no public API, it will get 1/1000th the attention as GPT-3.

2 replies, 36 likes

Madison May: @arankomatsuzaki Link for the lazy:

1 replies, 34 likes

D. Sivakumar: Very nice work from colleagues in my team and in sibling teams,

0 replies, 29 likes

Paul Liang: O(n) Transformer attention mechanism for long sequences: State of the art results with theory

0 replies, 12 likes

akira: A study that made Self-Attention more efficient. Combining three type attention: random, peripheral only, and full (with only some tokens). They showed that in many NLP tasks it is SOTA and theoretically an approximation of s2s and Turing completeness.

0 replies, 7 likes

Machine Learning and NLP: Big Bird: Transformers for Longer Sequences #NLProc

0 replies, 7 likes

Andrea Volpini: "We believe something like BigBird can be complementary to GPT-3. GPT-3 is still limited to 2048 tokens. We'd like to think that we could generate longer, more coherent stories by using more context" - Philip Pham one of Google’s Big Bird creators

0 replies, 7 likes

Daisuke Okanohara: BigBird is a Transformer that combines sparse random/ local window/global attention, achieving linear-complexity while a universal approximator of seq. function. Achieve new SOTA on NLP tasks and nearly perfect accuracy on promoter region prediction.

0 replies, 4 likes

arXiv CS-CL: Big Bird: Transformers for Longer Sequences

0 replies, 2 likes

Connor Shorten: Big Bird: Transformers for Longer Sequences 📈 "Can we achieve the empirical benefits of a fully quadratic self-attention scheme using fewer inner-products?" "Express all continuous (seq2seq) functions with only O(n)-inner products."

1 replies, 1 likes


Found on Jul 29 2020 at

PDF content of a computer science paper: Big Bird: Transformers for Longer Sequences