Stable Diffusion Part 2: Training the Variational AutoEncoder

Name: Stable Diffusion Part 2: Training the Variational AutoEncoder
Uploaded: Feb 5, 2026
Duration: 11097 s

Priyam Mazumdar3.59K subscribers

435 views

Feb 5, 2026

3:04:57

Today we build the first stage of the Stable Diffusion Pipeline, training the Variational AutoEncoder (VAE). The entire idea of Latent Diffusion is we want to train our final Diffusion model on the latent space. So first we need to train an encoder that can take images (or any data) and project it to a latent space, which is exactly what AutoEncoders do! Unfortunately, VAEs typically have poor reconstruction quality, so we need to provide a little help with some perceptual losses such as LPIPS and PatchGAN! Code: https://github.com/priyammaz/Stable-Diffusion-From-Scratch/tree/main Part 1: Perceptual Losses https://youtu.be/OEMQtaJ0KpY Prereqs: I hope you already know about VAE, GANs, UNETs, Diffusion and Attention! If you dont I have videos for all of that here: - VAE Derivation: https://youtu.be/jJZadDULoH4 - VAE Implementation: https://youtu.be/9NgC0sh9Msc - Diffusion Derivation: https://youtu.be/dSC9XOPJXK8 - Diffusion Implementation: https://youtu.be/_otkRnYaozY - GAN Derivation: https://youtu.be/CdMkDeEWOr4 - GAN Implementation: https://youtu.be/R9VOZnKEBE0 - UNET Implementation: https://youtu.be/fBlLsugz6Q8 - Attention: https://youtu.be/JXY5CmiK3LI Timestamps: 00:00:00 - Introduction 00:03:30 - Download Conceptual Captions (3M) 00:06:00 - Process Data and Convert to a Huggingface Dataset 00:20:00 - Writing the Dataset/Dataloader 00:24:00 - Writing the Collate Function 00:30:00 -Checking out the dataset 00:35:00 - VAE Recap (and limitations) 00:37:00 - Examples With vs Without Perceptual Losses 00:39:00 - Starting the VAE Implementation 00:41:00 - Upsample/Downsample Blocks 00:45:45 - Residual Block (with optional Time Embeddings) 00:57:45 - Attention Block (Self + Cross) 01:25:20 - Encoder/Decoder Blocks 01:30:30 - Attention Residual Block 01:33:15 - Full VAE Encoder/Decoder 01:45:00 - Wrap Encoder/Decoder together 01:46:45 - Complete the VAE Class 01:59:10 - Add in weight init to PatchGAN 02:00:30 - Start the Training Script 02:07:00 - Testing the Model input/output shapes 02:09:30 - Setup LPIPS and PatchGAN 02:14:45 - Setup everything else needed for Training 02:22:30 - Starting the Training loop 02:23:15 - Delayed GAN Training setup 02:27:15 - Writing the Generator Step 02:31:10 - Adaptive Loss Weighting 02:40:30 - Compute total final loss and update generator 02:42:10 - Writing the Discriminator Step 02:43:10 - Hinge Loss for GAN 02:45:30 - Compute total final loss and update discriminator 02:48:00 - Wrap up training with some visualizations 02:54:30 - Debugging/Run training 02:59:40 - Inference VAE and Check Reconstructions 03:04:10 - Next Steps Socials! X https://twitter.com/data_adventurer Instagram https://www.instagram.com/nixielights/ Linkedin https://www.linkedin.com/in/priyammaz/ Discord https://discord.gg/RaguqCTURA 🚀 Github: https://github.com/priyammaz 🌐 Website: https://www.priyammazumdar.com/

Download

1 formats

Video Formats

360pmp4233.2 MB

Download

Right-click 'Download' and select 'Save Link As' if the file opens in a new tab.