Back to Browse

GPU Pipeline Optimization Explained | Async UDFs, CUDA Streams & Pinned Memory

1.1K views
Sep 10, 2025
18:50

🖥️ Whiteboard Deep Dive into GPU Pipeline Optimization In this deep dive, Srinu Lade https://www.linkedin.com/in/srinivas-lade/ (Software Engineer working on Daft’s execution engine) breaks down how to optimize GPU pipelines for ML and multimodal data processing. Using architectural diagrams, he explains why sequential CPU→GPU execution creates bottlenecks and how techniques like async UDFs, CUDA streams, and pinned memory unlock parallelism. What you’ll learn: - How GPU workloads flow: host↔device transfers, VRAM, kernel execution - Why Python UDFs are a bottleneck — and how async execution improves throughput - Using CUDA streams to overlap transfers and compute for better utilization - How GPU internals (H2D/D2H engines + compute units) enable pipeline parallelism - Reducing OS overhead with pinned memory reuse in PyTorch workflows - How Daft abstracts these optimizations into a high-level API for data/ML engineers Our aim is to abstract away these low-level complexities and provide a high-level API in Daft that delivers optimized GPU execution out-of-the-box for ML workloads. — Daft. Simple and reliable data processing for any modality and scale. Explore → https://daft.ai/ Build → https://docs.daft.ai/ Connect → https://www.daft.ai/slack Contribute → https://github.com/Eventual-Inc/Daft Learn → https://daft.ai/blog pip install daft

Download

1 formats

Video Formats

360pmp447.5 MB

Right-click 'Download' and select 'Save Link As' if the file opens in a new tab.

GPU Pipeline Optimization Explained | Async UDFs, CUDA Streams & Pinned Memory | NatokHD