Iceberg I/O Optimizations in Compute Engines

Name: Iceberg I/O Optimizations in Compute Engines
Uploaded: Apr 30, 2025
Duration: 2307 s

Apache Iceberg3.28K subscribers

347 views

Apr 30, 2025

38:27

#icebergSummit 2025 breakout session delivered by Isaac Warren and Srinivas Lade from #BodoAI. Session Description: Efficient I/O is critical for high-performance analytics, and #ApacheIceberg offers powerful capabilities to optimize data access in compute engines. In this talk, we provide an overview of key I/O optimizations for Iceberg and discuss our implementation in Bodo’s open source compute engine for Python and SQL, along with potential additions to Iceberg standard to enable further optimizations. This provides significant insights for data practitioners to take advantage of Iceberg I/O optimizations more effectively. First, we address filter pushdown in Bodo’s Python compiler and SQL planner. We discuss how the Bodo compiler ensures that only Iceberg-compatible filters are pushed to I/O, which is especially challenging in Python. We also examine potential Iceberg standard improvements to support additional filters, such as substring operations, at the Parquet row group level. Next we discuss runtime join filters, which are created by analyzing the build side of join operations to exclude values that are guaranteed not to match. We explain how the Bodo compute engine pushes down these filters all the way to Iceberg data. We propose a novel enhancement that involves storing additional metadata in Puffin files.. By storing distinct values for low-NDV columns, we can prune unmatched values early in join processing, reducing unnecessary data scans. Next, we examine metadata file prefetching, a technique currently leveraged in Bodo’s reader for Snowflake-managed Iceberg tables but with potential applications for other catalogs. We discuss challenges related to asynchronous APIs in Python and explore possible solutions, including a C++-based solution, bypassing Python’s issues. We then discuss general file planning optimizations, where Iceberg metadata can be used to determine row counts, reducing the need for expensive row-group and row filtering. Additionally, we highlight possible enhancements to Bodo and PyIceberg that improve filter evaluation at the metadata level—simplifying predicates before applying them to Parquet files. The Iceberg standard could also store Parquet footers in metadata, allowing more efficient file planning of filtered files with a single metadata read. Through these optimizations and proposed enhancements, we showcase how compute engines can maximize Iceberg’s efficiency, reduce I/O overhead, and accelerate query performance. Attendees will gain insights into both practical implementations and future opportunities for extending Iceberg’s capabilities.

Download

0 formats

No download links available.