Back to Browse

Multimodal Large Language Model Intro By Google Engineer | LLaVA | BLIP-2

2.2K views
Jun 21, 2025
28:51

Slides and web pages are available at https://martinisadad.github.io/ You probably have already used Gemini or ChatGPT before, whether to assist your work task, or any question in life. Naturally, people will have questions, or need answers beyond text, including image, audio and video. This bring us to multimodal LLM (or MLLM). After this video, you should be familiar with the most typical MLLM architecture, training process and the philosophy behind it. You should be even ready to experiment on one yourself. Join this channel to get access to perks: https://www.youtube.com/channel/UCF94JVXOx8wy-3AN11T1-9A/join Related Video: Transformer Deep Dive https://youtu.be/TcKJMBZySj0 Vision Transformer Intro: https://youtu.be/BxQep0qdeWA CLIP Intro: https://youtu.be/-TdDZ6C9rdg #ai #llm #multimodal #multimodalai 0:00 Intro 1:58 MLLM Architecture 7:17 MLLM Training 9:15 MLLM Challenges 11:02 Example: LLaVA 20:03 Example: BLIP-2 26:16 Future Of MLLM

Download

0 formats

No download links available.

Multimodal Large Language Model Intro By Google Engineer | LLaVA | BLIP-2 | NatokHD