Papers of the day   All papers

Big Bird: Transformers for Longer Sequences

Comments

AK: Big Bird: Transformers for Longer Sequences 🐦 pdf: https://arxiv.org/pdf/2007.14062.pdf abs: https://arxiv.org/abs/2007.14062 https://t.co/XHuvaqPahM

3 replies, 157 likes


Delip Rao: That didn’t take long at all! As predicted in the recent @pagestlabs issue, long-span attention cost for transformer models like GPT-3 and T5 came down from O(N√N) to O(N) in BigBird. Looking forward to these models becoming viable for everyone to build. https://arxiv.org/abs/2007.14062

2 replies, 57 likes


elvis: Big Bird is a transformer-based model that more effectively supports NLP tasks requiring longer contexts. It satisfies the theoretical properties of the full model while reducing the attention mechanism complexity to linear in # of tokens. https://arxiv.org/abs/2007.14062 https://t.co/HiMEMSlezY

0 replies, 57 likes


elvis: I am noticing a lot of ML papers these days propose learning frameworks that are inspired by graph theory. BigBird, a recent effort to enable Transformers to deal with longer sequences, reduces the complexity of self-attention by exploiting graph theory. https://arxiv.org/abs/2007.14062 https://t.co/UHJKUO0ne4

1 replies, 49 likes


Kevin Lacker: This looks like Google’s best AI text-generator yet. Since there is no public API, it will get 1/1000th the attention as GPT-3.

2 replies, 36 likes


Madison May: @arankomatsuzaki Link for the lazy: https://arxiv.org/abs/2007.14062

1 replies, 34 likes


D. Sivakumar: Very nice work from colleagues in my team and in sibling teams, https://arxiv.org/abs/2007.14062

0 replies, 29 likes


Paul Liang: O(n) Transformer attention mechanism for long sequences: State of the art results with theory https://arxiv.org/abs/2007.14062 https://t.co/cRvYmKmSVB

0 replies, 12 likes


akira: https://arxiv.org/abs/2007.14062 A study that made Self-Attention more efficient. Combining three type attention: random, peripheral only, and full (with only some tokens). They showed that in many NLP tasks it is SOTA and theoretically an approximation of s2s and Turing completeness. https://t.co/LZCEPTpE4g

0 replies, 7 likes


Machine Learning and NLP: Big Bird: Transformers for Longer Sequences https://arxiv.org/pdf/2007.14062.pdf #NLProc

0 replies, 7 likes


Andrea Volpini: "We believe something like BigBird can be complementary to GPT-3. GPT-3 is still limited to 2048 tokens. We'd like to think that we could generate longer, more coherent stories by using more context" - Philip Pham one of Google’s Big Bird creators https://arxiv.org/abs/2007.14062 https://t.co/J81jImMp0M

0 replies, 7 likes


Daisuke Okanohara: BigBird is a Transformer that combines sparse random/ local window/global attention, achieving linear-complexity while a universal approximator of seq. function. Achieve new SOTA on NLP tasks and nearly perfect accuracy on promoter region prediction. https://arxiv.org/abs/2007.14062

0 replies, 4 likes


arXiv CS-CL: Big Bird: Transformers for Longer Sequences http://arxiv.org/abs/2007.14062

0 replies, 2 likes


Connor Shorten: Big Bird: Transformers for Longer Sequences 📈 "Can we achieve the empirical benefits of a fully quadratic self-attention scheme using fewer inner-products?" "Express all continuous (seq2seq) functions with only O(n)-inner products." https://arxiv.org/pdf/2007.14062.pdf

1 replies, 1 likes


Content

Found on Jul 29 2020 at https://arxiv.org/pdf/2007.14062.pdf

PDF content of a computer science paper: Big Bird: Transformers for Longer Sequences