Open source, open pipelines: Data ingestion with modern data stack

Name: Open source, open pipelines: Data ingestion with modern data stack
Uploaded: Streamed live on Apr 29, 2025
Duration: 5490 s

PyLadies Amsterdam930 subscribers

402 views

Streamed live on Apr 29, 2025

1:31:30

Data ingestion is the cornerstone of Data Engineering — it’s where every data journey begins. In this hands-on workshop, you’ll learn how to move data from anywhere to anywhere using the open-source modern data stack. We’ll focus on practical skills, leveraging Python library dlt (data load tool) to ingest data from a REST API and load it into DuckDB, a fast and lightweight database. Whether you're just getting started with data pipelines or looking to modernize your current stack, this session will give you a solid foundation for building reliable, open-source ingestion workflows. Come ready to write some code, get your hands dirty, and walk away with real-world ingestion superpowers. Timeline: 00:00:32 - Introduction PyLadies Amsterdam 00:03:36 - Introduction Violetta Mishechkina 00:04:28 - Workshop structure 00:05:47 - What is data ingestion? How does a dataset magically appear? 00:07:55 - Why are data pipelines so amazing? Why are data engineers so important? 00:11:42 - Why open source matters in data engineering 00:15:23 - Data ingestion step 1: Extracting data 00:16:22 - Extracting data from APIs 00:17:28 - Extracting data from REST APIs 00:19:42 - Common challenges in extracting data from REST APIs: rate limits, authentication, pagination, memory management 00:29:04 - Extracting data with dlt 00:32:46 - Exercise 1: Extract Paginated Data from the GitHub API 00:39:42 - Exercise 1: Solution 00:41:30 - Data ingestion step 2: Normalizing data 00:46:59 - Normalizing data with dlt 00:51:53 - Data ingestion step 3: Loading data 00:53:10 - What is DuckDB? 00:58:36 - Why use dlt for loading data? 01:03:04 - Exercise 2: Loading GitHub issues 01:07:57 - Exercise 2: Solution 01:11:12 - Fixing rate limit error by adding authentication 01:15:08 - Exercise 3: Add authentication to GitHub issues 01:17:14 - Exercise 3: Solution 01:18:59 - Incremental Loading 01:19:45 - Incremental loading methods in dlt 01:23:44 - Exercise 4: Load GitHub data into DuckDB with incremental loading 01:23:58 - Exercise 4: Solution 01:27:13 - dlt Sources (Where can you ingest data from?) and dlt Destinations (Where can you load data to?) 01:28:45 - Closing remarks and PyLadies Amsterdam announcements GitHub Repo https://github.com/pyladiesams/data-ingestion-modern-stack-apr2025 Speakers: Violetta Mishechkina https://www.linkedin.com/in/violetta-mishechkina/ Violetta Mishechkina is a Solutions Engineer at dltHub. She has been working in the data field since 2018, with a background in machine learning. Violetta started as a Data Scientist, training ML models and neural networks. A year ago, she joined dltHub’s Solutions Engineering team and discovered dlt, a Python library that automates 90% of tedious data engineering tasks. Now, she works closely with customers and partners to help them integrate and optimize dlt in production. Violetta also collaborates with her development team as the voice of the customer, ensuring the product meets real-world data engineering needs.

Download

0 formats

No download links available.