Diffusion Transformers (ViT, DiT, MMDiT)
This video covers the Vision Transformer (ViT), Diffusion Transformer (DiT) and Multimodal Diffusion Transformer (MMDiT). This is the architectural evolution that enabled the original Transformer model (initially designed for machine translation and language modeling) to replace the de-facto model for vision, the Convolutional Neural Networks (CNN). ▶️ Companion videos: - Transformers in language: https://youtu.be/SFi9KsnidNc?si=XmQpBqd0_KH7Vmcl - Diffusion fundamentals: https://youtu.be/R0uMcXsfo2o?si=LvBqX2-A1wm66iLJ - How the Transformer replaced CNNs: https://youtu.be/KnCRTP11p5U?si=2RrAya_2LU5I1Ms- 📚 Papers ViT: https://arxiv.org/abs/2010.11929 DiT: https://arxiv.org/abs/2212.09748 MMDiT: https://arxiv.org/abs/2403.03206 FiLM: https://arxiv.org/abs/1709.07871 My full reading list: https://www.patreon.com/c/JuliaTurc 00:00 Intro 01:13 Transformer recap 02:24 Image classification 03:35 Vision Transformer (ViT) 05:37 Image generation 07:54 Diffusion Transformer (DiT) 10:07 DiT in-context learning 10:38 DiT cross-attention 11:15 DiT adaLN (and FiLM inspiration) 14:26 DiT adaLN-Zero 16:03 Pixart-alpha 16:43 Multimodal Diffusion Transformer (MMDiT)
Download
0 formatsNo download links available.