🖥️ Whiteboard Deep Dive into GPU Pipeline Optimization
In this deep dive, Srinu Lade https://www.linkedin.com/in/srinivas-lade/ (Software Engineer working on Daft’s execution engine) breaks down how to optimize GPU pipelines for ML and multimodal data processing. Using architectural diagrams, he explains why sequential CPU→GPU execution creates bottlenecks and how techniques like async UDFs, CUDA streams, and pinned memory unlock parallelism.
What you’ll learn:
- How GPU workloads flow: host↔device transfers, VRAM, kernel execution
- Why Python UDFs are a bottleneck — and how async execution improves throughput
- Using CUDA streams to overlap transfers and compute for better utilization
- How GPU internals (H2D/D2H engines + compute units) enable pipeline parallelism
- Reducing OS overhead with pinned memory reuse in PyTorch workflows
- How Daft abstracts these optimizations into a high-level API for data/ML engineers
Our aim is to abstract away these low-level complexities and provide a high-level API in Daft that delivers optimized GPU execution out-of-the-box for ML workloads.
—
Daft. Simple and reliable data processing for any modality and scale.
Explore → https://daft.ai/
Build → https://docs.daft.ai/
Connect → https://www.daft.ai/slack
Contribute → https://github.com/Eventual-Inc/Daft
Learn → https://daft.ai/blog
pip install daft