Apache Airflow is the tool of choice of Data Engineers for orchestrating large scale data pipelines and integrates with lot of tools such as Apache Pig, Apache Hive, Apache Pinot, Google Kubernetes Engine, Google Dataproc to name a few.
In this video we'll discuss the Airflow's integration with Dataproc and see how we can setup a simple workflow of creating a transient cluster, submitting a job and then deleting the cluster.
This video is part of the course Apache Spark on Dataproc. You can find all the videos for this course in the following playlist.
https://www.youtube.com/playlist?list=PLeOtIjHQdqvFtYzoFL-DCYx_Sw-5iBJQ4
I regularly blog and post on my other social media channels as well, so do make sure to follow me there as well.
Sample DAG : https://gist.github.com/kaysush/ade06ca3b4f42218f720e92e455c7b7b
PySpark Code : https://gist.github.com/kaysush/65fdd9a5d5bb03a198d8fb1e23125bf1
Medium : https://medium.com/@Sushil_Kumar
Github : https://github.com/kaysush
Linkedin : https://linkedin.com/in/sushilkumar93