Big Bird: Transformers for Longer Sequences


Big Bird: Transformers for Longer Sequences

Delip Rao: That didn’t take long at all! As predicted in the recent @pagestlabs issue, long-span attention cost for transformer models like GPT-3 and T5 came down from O(N√N) to O(N) in BigBird. Looking forward to these models becoming viable for everyone to build.

elvis: Big Bird is a transformer-based model that more effectively supports NLP tasks requiring longer contexts. It satisfies the theoretical properties of the full model while reducing the attention mechanism complexity to linear in # of tokens.

elvis: I am noticing a lot of ML papers these days propose learning frameworks that are inspired by graph theory. BigBird, a recent effort to enable Transformers to deal with longer sequences, reduces the complexity of self-attention by exploiting graph theory.

Kevin Lacker: This looks like Google’s best AI text-generator yet. Since there is no public API, it will get 1/1000th the attention as GPT-3.

Madison May: @arankomatsuzaki Link for the lazy:

D. Sivakumar: Very nice work from colleagues in my team and in sibling teams,

Paul Liang: O(n) Transformer attention mechanism for long sequences: State of the art results with theory

akira: A study that made Self-Attention more efficient. Combining three type attention: random, peripheral only, and full (with only some tokens). They showed that in many NLP tasks it is SOTA and theoretically an approximation of s2s and Turing completeness.

Machine Learning and NLP: Big Bird: Transformers for Longer Sequences #NLProc

Andrea Volpini: "We believe something like BigBird can be complementary to GPT-3. GPT-3 is still limited to 2048 tokens. We'd like to think that we could generate longer, more coherent stories by using more context" - Philip Pham one of Google’s Big Bird creators

Daisuke Okanohara: BigBird is a Transformer that combines sparse random/ local window/global attention, achieving linear-complexity while a universal approximator of seq. function. Achieve new SOTA on NLP tasks and nearly perfect accuracy on promoter region prediction.

arXiv CS-CL: Big Bird: Transformers for Longer Sequences

Connor Shorten: Big Bird: Transformers for Longer Sequences 📈 "Can we achieve the empirical benefits of a fully quadratic self-attention scheme using fewer inner-products?" "Express all continuous (seq2seq) functions with only O(n)-inner products."

