GPU go brrr: Reproducing ResNet50 using Huggingface 🤗 Accelerate & Distributed Training

Name: GPU go brrr: Reproducing ResNet50 using Huggingface 🤗 Accelerate & Distributed Training
Uploaded: Feb 3, 2025
Duration: 8112 s

Priyam Mazumdar3.65K subscribers

1.3K views

Feb 3, 2025

2:15:12

Link to Code: https://github.com/priyammaz/PyTorch-Adventures/tree/main/PyTorch%20Basics/Huggingface%20Accelerate Github Repo: https://github.com/priyammaz/PyTorch-Adventures To train large models (think Transformers) we need to use buckets of GPUs! Uptil now we have only looked at training on a single GPU, but if we have a bunch of them, we should be able to use them! Today we will look at my favorite way of distributed training using Huggingface 🤗 Accelerate. We will be looking at creating a controllable script through Argparse, the different types of distributed training, the use of learning rate schedulers, how we can use gradient accumulation to simulate larger batch sizes, GPU synchronization, and training checkpointing. Also, as an extra, I love to log everything inside Weights and Biases, so we see how to do that using Accelerate as well! This full reproduction will be done on ImageNet, but use any image classification dataset you got! **Note** My explanation of Floating Point Precision is not exactly accurate (before all the Computer Engineers get angry!!!) This was just a rough intuition of what BFloat16 is. Here is a helpful article: https://medium.com/@furkangozukara/what-is-the-difference-between-fp16-and-bf16-here-a-good-explanation-for-you-d75ac7ec30fa from an actual Computer Engineer if you want to know precisely what's going on! Timestaps: 00:00:00 Introduction 00:01:35 Why do we need Distributed Training? 00:07:00 Simulating Larger Batch Sizes w/ Gradient Accumulation 00:08:27 Model/Tensor Parallelism 00:13:40 Distributed Data Parallelism (DDP) 00:22:00 Setting up the Accelerate Environment 00:27:00 What is BFloat16? 00:37:45 Argparse 00:42:26 Initializing the Accelerator 00:49:00 TorchMetrics for Accuracy 00:50:10 Everything Runs Multiple Times! 00:53:33 Loading the ResNet50 Model 00:54:50 Training/Testing Transforms 00:58:37 Load ImageNet Dataset/DataLoader 01:04:55 What is Weight Decay? 01:10:20 Optimizer Groups 01:18:50 Learning Rate Schedulers 01:28:00 Prepare Everything for Distributed Training 01:30:00 Resuming from Checkpoint 01:37:10 Starting Training Loop 01:42:10 accelerator.accumulate() 01:45:00 Computing Gradients 01:46:25 Gradient Synchronization 01:53:00 Gathering Across GPUs 01:59:25 Evaluation Loop 02:02:00 Logging to Weights and Biases 02:04:00 Updating the LR Scheduler 02:05:49 Checkpointing our Model 02:07:20 Training w/o DDP 02:09:40 Training w/ DDP 02:12:50 Final Training Results Socials! X https://twitter.com/data_adventurer Instagram https://www.instagram.com/nixielights/ Linkedin https://www.linkedin.com/in/priyammaz/ Discord https://discord.gg/RaguqCTURA 🚀 Github: https://github.com/priyammaz 🌐 Website: https://www.priyammazumdar.com/

Download

0 formats

No download links available.