Let's train Vision Language Models (VLM) from scratch using just Text-Only LLMs!

Name: Let's train Vision Language Models (VLM) from scratch using just Text-Only LLMs!
Uploaded: Jan 30, 2026
Duration: 1803 s

Neural Breakdown with AVB31.7K subscribers

10.0K views

Jan 30, 2026

30:03

This is a video about Multimodal Vision Language Models, in which we take a simple text-only language model (LLM) and give it vision capabilities. We visually explain the Query Former (Q-Former) model, introduced in the BLIP-2 paper. We will cover all the code and present a thorough step-by-step guide to training these VLMs yourself! To join our Patreon and support this channel financially, visit: https://www.patreon.com/NeuralBreakdownwithAVB Members get access to everything behind-the-scenes that goes into producing my videos - including code. Plus, it supports the channel in a big way and helps to pay my bills. You can read the BLIP-2 paper here: https://paperbreakdown.com/abs/2301.12597 Paper Breakdown makes it way easier to discover Computer Science research, get personalized paper recommendations to study every day, and access a premium collection of tools to study interactively with context-aware AI agents. Get 50% off using code - VLM50 Follow me on X: https://x.com/neural_avb Git repo: https://github.com/avbiswas/vlm Attention to Transformers playlist: https://www.youtube.com/playlist?list=PLGXWtN1HUjPfq0MSqD5dX8V7Gx5ow4QYW Guide to fine-tuning open source LLMs: https://youtu.be/bZcKYiwtw1I Multimodal models theory: https://youtu.be/-llkMpNH160 VIT: https://youtu.be/l4KitGnDXxo Timestamps: 0:00 - Intro 5:45 - Vision Transformers 6:52 - Coding ViT 8:52 - Q-Former models 11:45 - Coding Q-Former from a BERT 12:36 - Cross Attention in Transformers 17:52 - Coding Q-Formers 21:33 - LORA finetune Language Model 27:12 - Summary #ai #deeplearning

Download

0 formats

No download links available.