Back to Browse

Deep Dive: Quantizing Large Language Models, part 1

23.4K views
Mar 6, 2024
40:28

Quantization is an excellent technique to compress Large Language Models (LLM) and accelerate their inference. In this video, we discuss model quantization, first introducing what it is, and how to get an intuition of rescaling and the problems it creates. Then we introduce the different types of quantization: dynamic post-training quantization, static post-training quantization, and quantization-aware training. Finally, we start looking at and comparing actual quantization techniques: PyTorch, ZeroQuant, and bitsandbytes. In part 2 https://youtu.be/fXBBwCIA0Ds, we look at and compare more advanced quantization techniques: SmoothQuant, GPTQ, AWQ, HQQ, and the Hugging Face Optimum Intel library based on Intel Neural Compressor and Intel OpenVINO. Slides: https://fr.slideshare.net/slideshow/julien-simon-deep-dive-quantizing-llms/270921785 ⭐️⭐️⭐️ Don't forget to subscribe to be notified of future videos. Follow me on Medium at https://julsimon.medium.com or Substack at https://julsimon.substack.com. ⭐️⭐️⭐️ 00:00 Introduction 02:05 What is quantization? 06:50 Rescaling weights and activations 08:17 The mapping function 12:38 Picking the input range 16:15 Getting rid of outliers 19:50 When can we apply quantization? 26:00 Dynamic post-training quantization with PyTorch 28:42 ZeroQuant 34:50 bitsandbytes

Download

1 formats

Video Formats

360pmp460.6 MB

Right-click 'Download' and select 'Save Link As' if the file opens in a new tab.

Deep Dive: Quantizing Large Language Models, part 1 | NatokHD