Papers of the day   All papers

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

Comments

Aug 07 2019 Devi Parikh

Presenting ViLBERT! It learns visiolinguistic representations that transfer well. SOTA on VQA, captioning, referring expressions, visual commonsense reasoning -- all with minor additions to the base architecture. https://arxiv.org/pdf/1908.02265.pdf Work led by @jiasenlu and @stefmlee. https://t.co/0NaeIfBw7Z
6 replies, 350 likes


Aug 22 2019 Devi Parikh

ViLBERT code is now available! Check it out, and feedback is very welcome. https://github.com/jiasenlu/vilbert_beta
0 replies, 109 likes


Aug 07 2019 Miles Brundage

"ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks," Lu et al.: https://arxiv.org/abs/1908.02265 "Our work represents a shift ... towards treating visual grounding as a pretrainable and transferable capability."
1 replies, 103 likes


Aug 13 2019 Paul Liang

some exciting recent work in self-supervised multimodal learning including VideoBERT (https://arxiv.org/abs/1904.01766), ViLBERT (https://arxiv.org/abs/1908.02265), and VisualBERT (https://arxiv.org/abs/1908.03557). for more papers in multimodal representation learning, check out https://github.com/pliang279/awesome-multimodal-ml https://t.co/8tCQ0Gg5Qo
0 replies, 84 likes


Aug 08 2019 Natasha Jaques

https://t.co/Y2aeYXCRKv
0 replies, 24 likes


Aug 08 2019 Victor Dibia

They show that learning joint representations of image content and natural language using a BERT based architecture (multi-modal two-stream model, co-attentional transformer layers) improves performance for image+language tasks such as visual question answering, etc https://t.co/gdqmHxVayX
1 replies, 6 likes


Aug 07 2019 Loren Lugosch

It’s impossible to learn language just by reading lots and lots of language; you need to connect it to stuff perceived in the world. That’s why the GPT-2 model could write a big, coherent article about unicorns, but described a unicorn with 4 horns. Nice to see work like this.
1 replies, 6 likes


Aug 07 2019 小猫遊りょう(たかにゃし・りょう)

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks https://arxiv.org/abs/1908.02265 “We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language.”
1 replies, 1 likes


Content