How Spark Tungsten Turns Plans Into Generated Code
Spark performance is not only a cluster problem. After Catalyst has chosen a plan, every executor still has to run a local row-processing loop, and that loop can waste work through object allocation, pointer chasing, iterator calls, garbage collection, and opaque function boundaries. This chapter explains Tungsten and whole-stage code generation from the executor's point of view. We cover why generic JVM object shapes hurt analytical workloads, how `UnsafeRow` stores rows as compact binary layouts, why vectorized scans and columnar batches matter before the loop starts, how the volcano model pays for abstraction on every row, how whole-stage codegen fuses compatible operators into one generated function, where fusion breaks, and why UDF boundaries can be expensive even when the function body looks simple. 0:00 The hidden Spark loop 0:36 More machines can hide a bad loop 1:17 After Catalyst 1:49 The JVM object tax 2:35 UnsafeRow and binary rows 3:43 Memory pressure in the hot path 4:16 Columnar batches and vectorized scans 4:58 The volcano model tax 5:43 Whole-stage codegen 6:49 What generated code does 7:25 Where fusion breaks 8:03 How to see it in the plan 9:02 Why UDFs feel expensive 10:03 The tradeoff 10:49 The practical reading habit 11:30 Closing and next topic Full series playlist (16 chapters): https://www.youtube.com/playlist?list=PLR8zgz1piLsgZqQ-OlCOlAP3BmvFXmbTZ 1. Advanced Spark Ch.1: When More Executors Do Not Make Spark Faster: https://youtu.be/l2FAI-glHsM 2. Advanced Spark Ch.2: From DataFrame Call to Physical Plan: https://youtu.be/I_9dVgtN-68 3. Spark Catalyst: Why Your Query Gets Rewritten: https://youtu.be/w69p_Ax0Ldc 4. Advanced Spark Ch.4: Tungsten and Whole-Stage Codegen (this video): https://youtu.be/RvrvnIKt3M8 5. Shuffle Costs 6. Join Strategies 7. Small Files and Scan Overhead 8. Python Boundaries and UDF Costs 9. Adaptive Query Execution at Runtime 10. Structured Streaming Internals 11. Stateful Streaming and State Stores 12. Spark Connect Architecture 13. DataSource V2 and Lakehouse Connectors 14. Debugging Plans, Stages, and Shuffles 15. Kubernetes and Object Storage 16. The 4.x Roadmap Subscribe for new chapters weekly. Subtitles: English
Download
0 formatsNo download links available.