Slides and web pages are available at https://martinisadad.github.io/
You probably have already used Gemini or ChatGPT before, whether to assist your work task, or any question in life. Naturally, people will have questions, or need answers beyond text, including image, audio and video. This bring us to multimodal LLM (or MLLM).
After this video, you should be familiar with the most typical MLLM architecture, training process and the philosophy behind it. You should be even ready to experiment on one yourself.
Join this channel to get access to perks:
https://www.youtube.com/channel/UCF94JVXOx8wy-3AN11T1-9A/join
Related Video:
Transformer Deep Dive https://youtu.be/TcKJMBZySj0
Vision Transformer Intro: https://youtu.be/BxQep0qdeWA
CLIP Intro: https://youtu.be/-TdDZ6C9rdg
#ai #llm #multimodal #multimodalai
0:00 Intro
1:58 MLLM Architecture
7:17 MLLM Training
9:15 MLLM Challenges
11:02 Example: LLaVA
20:03 Example: BLIP-2
26:16 Future Of MLLM
Download
0 formats
No download links available.
Multimodal Large Language Model Intro By Google Engineer | LLaVA | BLIP-2 | NatokHD