Back to Browse

DuckDB & dbt | End-To-End Data Engineering Project (2/3)

20.9K views
Mar 1, 2024
37:24

@mehdio is taking you to part 2 of this end-to-end data engineering project series: transform data using dbt and DuckDB! You will be leveraging all the in-memory capabilities of DuckDB to smooth your development process and deployment to the cloud. Part 3 : https://youtu.be/ta_Pzc2EEEo Part 2 : https://www.youtube.com/watch?v=3pLKTmdWDXk ☁️🦆 Start using DuckDB in the Cloud for FREE with MotherDuck : https://hubs.la/Q02QnFR40 📓 Resources * Github Repo of the tutorial : https://github.com/mehd-io/pypi-duck-flow * Part 1 of the series (Ingestion using Python & DuckDB) : https://www.youtube.com/watch?v=3pLKTmdWDXk&t * How to install DuckDB CLI : https://duckdb.org/docs/installation/index?version=latest&environment=cli&installer=binary&platform=macos * dbt unit testing package : https://github.com/EqualExperts/dbt-unit-testing * dbt-duckdb adapter : https://github.com/duckdb/dbt-duckdb ➡️ Follow Us LinkedIn: https://linkedin.com/company/motherduck Twitter : https://twitter.com/motherduck Blog: https://motherduck.com/blog/ 0:00 Intro 1:38 Architecture recap 3:32 Coding the transform pipeline using dbt & DuckDB 35:01 Wrapping up & what's next #duckdb #dataengineering #sql #python Discover how to build production-ready data pipelines with dbt and DuckDB, breaking free from slow, cloud-dependent development loops. In this practical data engineering project, we analyze Python library download statistics to demonstrate how the dbt DuckDB adapter can simplify your architecture, speed up transformations, and enable true unit testing without any cloud dependencies. This tutorial is perfect for developers looking to improve their dbt workflow and overall developer experience. Get started by setting up a brand new dbt project with the `dbt-duckdb` adapter. We'll show you how to configure your profiles for both local development with an embedded DuckDB instance and seamless scaling to the cloud with MotherDuck. Learn the power of an interactive development loop by using the DuckDB CLI to directly query and explore complex Parquet data from an AWS S3 bucket, allowing you to quickly understand your data schema before writing a single dbt model. Follow along as we construct our first dbt model, transforming raw download data into valuable daily aggregates. This section covers essential SQL transformation techniques, including how to parse nested structs, clean data with `CASE` statements, and generate unique IDs using hash functions. We'll then enhance our model by using Jinja templating to replace hardcoded paths with dynamic `source()` references and implement variable-based date filters for running backfills and incremental loads. Learn how to write robust dbt unit tests for your data models using the `dbt-unit-testing` package. We demonstrate a powerful local-first testing strategy where you mock your source data using simple SQL `SELECT` statements and define your expected output. This entire process runs instantly on your local machine with DuckDB, providing a fast feedback loop and enabling true CI/CD for your data pipelines without relying on a cloud data warehouse. Finally, we cover deployment strategies for both local processing and cloud scaling. Learn how to use dbt macros and DuckDB's `COPY` command to write partitioned Parquet files back to AWS S3. We'll also configure our project to target MotherDuck, showing you how to manage credentials, leverage powerful incremental models, and push your transformed data to a scalable, serverless cloud data warehouse, preparing it for BI and analytics. Watch with full transcript & resources: https://motherduck.com/videos/duckdb-dbt-end-to-end-data-engineering-project-23/

Download

1 formats

Video Formats

360pmp468.7 MB

Right-click 'Download' and select 'Save Link As' if the file opens in a new tab.

DuckDB & dbt | End-To-End Data Engineering Project (2/3) | NatokHD