Back to Browse

Inference Engines (Part 1)

19.8K views
Mar 12, 2026
8:36

GTC Sessions: https://www.nvidia.com/gtc/session-catalog/sessions/gtc26-s82448/?ncid=ref-inpa-249-prsp-en-us-1-l33 (Deploying AI Agents at Enterprise Scale) https://www.nvidia.com/gtc/session-catalog/sessions/gtc26-s81558/?ncid=ref-inpa-249-prsp-en-us-1-l33 (Post-Training Nemotron With RL) NVIDIA 4080 Super Giveaway: https://docs.google.com/forms/d/1K_70PPbO69ygP32h6PwjDmw8pSeUS97Tk82RVUvHBRY/edit?usp=sharing Inference is an important topic but rather underappreciated especially given the potential gain in how fast and efficient we can run the underlying models. As models grow and architectures are getting more complex, it's important to understand some of the key components when it comes to actually running these models for inference. How did they change over the years? and how has advancements in NVMe, PCIe, and HBM affect it? What will SGLang, vLLM, NVIDIA Dynamo, and Tensor-RT be shaped going forward? #ai #deeplearning #inference #datacenters Chapters 00:00 Intro 01:18 Model Parallelism 02:26 MP Benefits 02:41 SLO 04:19 MP Limitations 04:44 Inference Engine 05:30 Batching 06:46 KV Cache 07:34 Part 2? 07:54 GTC 2026

Download

1 formats

Video Formats

360pmp49.2 MB

Right-click 'Download' and select 'Save Link As' if the file opens in a new tab.

Inference Engines (Part 1) | NatokHD