Inference Engines (Part 1)

Name: Inference Engines (Part 1)
Uploaded: Mar 12, 2026
Duration: 516 s

Caleb Writes Code79.4K subscribers

19.8K views

Mar 12, 2026

8:36

GTC Sessions: https://www.nvidia.com/gtc/session-catalog/sessions/gtc26-s82448/?ncid=ref-inpa-249-prsp-en-us-1-l33 (Deploying AI Agents at Enterprise Scale) https://www.nvidia.com/gtc/session-catalog/sessions/gtc26-s81558/?ncid=ref-inpa-249-prsp-en-us-1-l33 (Post-Training Nemotron With RL) NVIDIA 4080 Super Giveaway: https://docs.google.com/forms/d/1K_70PPbO69ygP32h6PwjDmw8pSeUS97Tk82RVUvHBRY/edit?usp=sharing Inference is an important topic but rather underappreciated especially given the potential gain in how fast and efficient we can run the underlying models. As models grow and architectures are getting more complex, it's important to understand some of the key components when it comes to actually running these models for inference. How did they change over the years? and how has advancements in NVMe, PCIe, and HBM affect it? What will SGLang, vLLM, NVIDIA Dynamo, and Tensor-RT be shaped going forward? #ai #deeplearning #inference #datacenters Chapters 00:00 Intro 01:18 Model Parallelism 02:26 MP Benefits 02:41 SLO 04:19 MP Limitations 04:44 Inference Engine 05:30 Batching 06:46 KV Cache 07:34 Part 2? 07:54 GTC 2026

Download

1 formats

Video Formats

360pmp49.2 MB

Download

Right-click 'Download' and select 'Save Link As' if the file opens in a new tab.