We just released a new open source boilerplate template to help you (any Spark user) run spark-submit commands smoothly — such as inserting dependencies, project source code and more.
TLDR: Here is an open source template to help you get started
At Soluto, as part of Data Scientist day-to-day work, we create ETL (Extract, Transform, Load) jobs. Our main tool for this is Spark, specifically, PySpark, with spark-submit.
Spark is used for distributed computing on large-scale datasets. spark-submit helps you launch your code application on your cluster.
Here are some examples of jobs we run daily at Soluto:
Some of the basic needs when using Spark for ETL jobs:
We created a simple template that can help you get started running ETL jobs using PySpark (both using spark-submit and interactive shell), create Spark context and sql context, use simple command line arguments and load all your dependencies (your project source code and third party requirements).
So if you’re starting a new Spark project, “Fork” it on GitHub and enjoy Sparking it up!
Please feel free to share any thoughts, open issues and contribute code!