Building a Real-Time Inference Stack on AMD Instinct GPUs

Name: Building a Real-Time Inference Stack on AMD Instinct GPUs
Uploaded: May 14, 2026
Duration: 1335 s

AMD Developer Central30.2K subscribers

43 views

May 14, 2026

22:15

Speakers Gaël Delalleau. Founder and CEO, Kog Augustin Verneuil, GPU engineer, Kog Talk Abstract: In this talk, we share our vision for real-time generative AI, and the techniques we developed to achieve the fastest LLM inference on GPU ever, with a generation speed of 2500 tokens/s per request. We first showcase our end-to-end stack optimized for minimal latency on AMD hardware, spanning model re-architecting, a single monokernel implementation, along with topology-aware algorithms. In the second part, we focus on one of the defining challenges of megakernels, intra-GPU grid synchronization barriers and reduce/gather primitives. Using a chiplet-aware approach grounded in deep hardware insight, we are able to decrease the overhead from 1.5µs to 600ns. Find the resources you need to develop using AMD products: https://www.amd.com/en/developer.html Join the Developer Community: https://devcommunity.amd.com/ Join the Developer Discord server: https://discord.gg/amd-dev *** © 2026 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, EPYC, ROCm, and AMD Instinct and combinations thereof are trademarks of Advanced Micro Devices, Inc.

Download

1 formats

Video Formats

360pmp428.6 MB

Download

Right-click 'Download' and select 'Save Link As' if the file opens in a new tab.