Threaded Tasks in PySpark Jobs

TLDR

There are circumstances when tasks (Spark action, e.g. save, count, etc) in a PySpark job can be spawned on separate threads. Spark assigns tasks between jobs in a “round robin” fashion, so that all jobs get a roughly equal share of cluster resources. This means that short jobs submitted while a long job is running can start receiving resources right away and still get good response times, without waiting for the long job to finish. An important reminder is to set set('spark.scheduler.mode','FAIR') in the sparkContext.via the TL;DR App

no story

Written by rick-bahague | Free & Open Source Advocate. Data Geek - Big or Small.

Published by HackerNoon on 2019/08/03