I Split LLM Inference Across Two GPUs: Prefill, Decode, and KV Cache

Name: I Split LLM Inference Across Two GPUs: Prefill, Decode, and KV Cache
Uploaded: May 8, 2026
Duration: 1656 s

Onchain AI Garage12.9K subscribers

3.5K views

May 8, 2026

27:36

Kimi published a paper splitting LLM inference across two separate data centers. So I tried to reproduce it using my PC and my laptop. Sign up for my FREE weekly newsletter, where I spill my unfiltered thoughts on the latest AI news, cool research, and projects I'm building: https://www.onchainaigarage.com/ 🐦 Follow Tonbi on X for real-time AI x blockchain updates! https://x.com/tonbistudio Kimi's new "Prefill as a Service" paper shows that prefill and decode have completely different hardware bottlenecks — prefill is compute-bound, decode is memory-bandwidth-bound — and splitting them across optimized clusters can cut costs dramatically, unlocked by a hybrid attention architecture that keeps the KV cache small enough to move across commodity ethernet. Instead of just reading the paper, I tried to recreate the architecture on my desktop RTX 3060 and laptop RTX 4070 using HuggingFace transformers and raw TCP, ran Zamba 2 split between them, and measured how closely my KV cache numbers matched the paper's predictions — then added 4-bit quantization to compensate for the compression I couldn't bake in (MLA). ✅ Full concept breakdown: why prefill is compute-bound and decode is memory-bound, the chef analogy, and why Kimi picked H200s for one cluster and H20s for the other. ✅ Validated the paper's KV cache math byte-for-byte on my own hardware — Mamba state stayed fixed regardless of prompt length, attention KV grew linearly, total growth was sub-linear exactly as predicted. ✅ Honest limits: my 1 GB/s ethernet couldn't match the required bandwidth and the Zamba 2 model lacks MLA compression — so the split ran slower than local, but the mechanism itself was confirmed. 💻 Tonbi's GitHub: https://github.com/tonbistudio 🌐 Portfolio: https://www.tonbistudio.com Resources: 🔗 Kimi Prefill as a Service Paper: https://arxiv.org/html/2604.15039v1 🔗 Zamba 2 Model (Zyphra): https://huggingface.co/Zyphra/Zamba2-2.7B 🔗 HuggingFace Transformers: https://huggingface.co/docs/transformers 🔗 Claude Code: https://claude.ai/claude-code Timestamps: 0:00 - Intro: Kimi's Prefill as a Service paper 2:19 - Why split an LLM? Prefill vs decode explained 4:56 - Compute vs memory bandwidth (the chef analogy) 7:15 - KV cache and why the transfer normally kills this 9:19 - Hybrid attention: what makes the split feasible 10:52 - The paper's setup, results, and cost savings 13:25 - My experiment: splitting inference across PC and laptop 15:25 - Validating the paper's KV cache predictions byte-for-byte 21:01 - Live run with 4-bit KV quantization to compensate for MLA 23:58 - Switching roles: laptop prefill, desktop decode 26:34 - Wrap up: confirmed mechanism, honest limitations Coming Next: More research reproductions on consumer hardware — plus continuing the Hermes Agent Master Class and GPU optimization experiments! 👀 Have you experimented with disaggregated inference or tried to reproduce a recent AI paper locally? Drop your results in the comments! If this was interesting, please like, subscribe, and hit the bell for more research reproductions! 🦐✨ #PrefillAsAService #KimiAI #LLMInference #KVCache #Mamba #HybridAttention #ResearchReproduction #ClaudeCode #LocalLLM #GPU #MLEngineering #AIResearch

Download

1 formats

Video Formats

360pmp432.3 MB

Download

Right-click 'Download' and select 'Save Link As' if the file opens in a new tab.