Back to Browse

Build Python Extensions for Apache Arrow Data with nanoarrow

394 views
Aug 13, 2024
31:22

nanoarrow is a fantastic library to use when interacting with Apache Arrow data. Unfortunately, there isn't a lot of literature showing you how to use it, but this video is hoping to change that. In this video, I walk you through a sample Python extension that can sum integers from an Apache Arrow array. To enable this, we will be using the Arrow C Data specification (which nanoarrow wraps) alongside the Arrow PyCapsule Interface. While I wouldn't encourage you to implement this particular sum algorithm in a production setting (off the shelf solutions are still faster!), it helps to establish the workflow needed to build such algorithms. By following the steps in this guide and applying it to your own use cases, you can create robust, scalable computations and deploy them against Apache Arrow data. Many popular dataframe libraries like pandas and polars are already using Apache Arrow, so this is a great way to plug into datasets created with those libraries and extend them with your own computations. Source code for this project is available at https://github.com/WillAyd/pyarrow_ext 00:00 - Introduction and project setup 03:30 - Meson build configuration 05:52 - Stubbing out the implementation 07:15 - Initial build and IDE support 08:28 - Starting our implementation 09:27 - Validating input types 13:05 - Extracting Python capsules 22:30 - Implementing our sum algorithm 25:44 - Testing it out... 27:56 - Debugging with Address Sanitizer

Download

0 formats

No download links available.

Build Python Extensions for Apache Arrow Data with nanoarrow | NatokHD