Deep Dive: Quantizing Large Language Models, part 1

Name: Deep Dive: Quantizing Large Language Models, part 1
Uploaded: Mar 6, 2024
Duration: 2428 s

Julien Simon521K subscribers

23.4K views

Mar 6, 2024

40:28

Quantization is an excellent technique to compress Large Language Models (LLM) and accelerate their inference. In this video, we discuss model quantization, first introducing what it is, and how to get an intuition of rescaling and the problems it creates. Then we introduce the different types of quantization: dynamic post-training quantization, static post-training quantization, and quantization-aware training. Finally, we start looking at and comparing actual quantization techniques: PyTorch, ZeroQuant, and bitsandbytes. In part 2 https://youtu.be/fXBBwCIA0Ds, we look at and compare more advanced quantization techniques: SmoothQuant, GPTQ, AWQ, HQQ, and the Hugging Face Optimum Intel library based on Intel Neural Compressor and Intel OpenVINO. Slides: https://fr.slideshare.net/slideshow/julien-simon-deep-dive-quantizing-llms/270921785 ⭐️⭐️⭐️ Don't forget to subscribe to be notified of future videos. Follow me on Medium at https://julsimon.medium.com or Substack at https://julsimon.substack.com. ⭐️⭐️⭐️ 00:00 Introduction 02:05 What is quantization? 06:50 Rescaling weights and activations 08:17 The mapping function 12:38 Picking the input range 16:15 Getting rid of outliers 19:50 When can we apply quantization? 26:00 Dynamic post-training quantization with PyTorch 28:42 ZeroQuant 34:50 bitsandbytes

Download

1 formats

Video Formats

360pmp460.6 MB

Download

Right-click 'Download' and select 'Save Link As' if the file opens in a new tab.