This video explores the ImageBERT model from Microsoft Research! This is a really interesting combination of vision and language tokens to achieve state of the art results on MSCOCO and Flickr30k image and text retrieval tasks! I hope this video helped you get a better sense of how image and text tokens can be combined in the transformer architecture and how self-attention uses visual tokens to inform the text task output of BERT's masked language modeling!
Paper Links:
ImageBERT: https://arxiv.org/pdf/2001.07966.pdf
Conceptual Captions: https://ai.googleblog.com/2018/09/conceptual-captions-new-dataset-and.html
Thanks for watching! Please Subscribe!