Back to Browse

Parallel table ingestion with a Spark Notebook (PySpark + Threading)

17.3K views
May 6, 2022
12:33

If we want to kick off a single Apache Spark notebook to process a list of tables we can write the code easily. The simple code to loop through the list of tables ends up running one table after another (sequentially). If none of these tables are very big, it is quicker to have Spark load tables concurrently (in parallel) using multithreading. There are some different options of how to do this, but I am sharing the easiest way I have found when working with a PySpark notebook in Databricks, Azure Synapse Spark, Jupyter, or Zeppelin. Written tutorial and links to code: https://dustinvannoy.com/2022/05/06/parallel-ingest-spark-notebook/ More from Dustin: Website: https://dustinvannoy.com LinkedIn: https://www.linkedin.com/in/dustinvannoy Twitter: https://twitter.com/dustinvannoy Github: https://github.com/datakickstart CHAPTERS: 0:00 Intro and Use Case 1:05 Code example single thread 4:36 Code example multithreaded 7:15 Demo run - Databricks 8:46 Demo run - Azure Synapse 11:48 Outro

Download

0 formats

No download links available.

Parallel table ingestion with a Spark Notebook (PySpark + Threading) | NatokHD