How to Deploy ETL and ML Pipelines in the Fastest, Cheapest and Most Flexible Way Possible
An ideal DataOps solution would address the following problems: The pipeline can be broken down into separate steps each having its own dependencies, language version and etc. The pipeline management, like starting up and shutting down the resources has to be automatic, in order to incur only costs per usage. The platform will build a separate environment with all your dependencies from requirements. You choose what server to use to run your applications, big memory, big CPU, GPU, etc. You can add more pipeline steps, by repeating everything from 1 to 5 if you wish to create more complex pipelines.
Photo by Shubham Dhage on Unsplash
Data engineers, data scientists, or ML engineers, often need to build pipelines that take a long time to run, are resource-hungry, require tricky scheduling and execution order, and often need different execution environments for different stages of the pipeline.
Typically, the solution is some form of Pipeline orchestration software like Airflow or Prefect, both are great options for production workloads however they clearly have some cons.
- It is difficult to separate nodes into completely different environments
- It is impossible to set up separate server instances for different stages in the ETL or Training pipeline.
- The initial setup can be cumbersome. Migration from Python script to Airflow Dag requires additional coding and configurations.
- It is not cloud-based, thus little flexibility when it comes to resources, and if it is, like Composer from Google, it requires additional configuration.
Generally, an ideal DataOps solution would address the following problems:
- The pipeline can be broken down into separate steps each having its own dependencies, language version and etc.
- Each step in the pipeline can be assigned to a separate easily configurable server. For example, the data processing step would require a server with high memory usage and powerful CPU, and the model training step would require GPU while loading step high network bandwidth
- The pipeline management, like starting up and shutting down the resources has to be automatic, in order to incur only costs per usage
- Scheduling has to be integrated and easily configurable
- Ability to create complex pipelines, using DAGs
- Has to be a low or preferably no-code solution, for fast adoption
The good news is I created the solution that solves all of the pains above.
- You write your ETL, ML, or DS code.
- Your zip and load your code to our platform along with all necessary artifacts like config files or models, and requirements.txt
- You choose which function to run.
- You choose what server to use to run your applications, big memory, big CPU, GPU, etc.
- You choose when to run your code, on-demand or on schedule
- You can add more pipeline steps, by repeating everything from 1 to 5 if you wish to create more complex pipelines
- Once you are done, you just have to press start and we will build the entire pipeline for you
- The platform will build a separate environment with all your dependencies from requirements.txt for each step you configured
- The platform will set up a separate server chosen by characteristics for each step
- And finally, the platform will execute the run on-demand or on a schedule depending on what you have chosen, most importantly starting and stoping all the servers as they start and stopping executing step by step
I built this platform as the company’s internal MLOps solution and use it daily for ETL and ML pipelines that require long run time, the flexibility of intense resources, and the flexibility of environments.
If you wish to deploy ETL or ML pipelines without the pain of configuring cloud solutions or costly cloud VMs that charge you even when you are not using them then let us know about your interest by filling up this form, and we will share our solution with you. bit.ly/beta-test-h
L O A D I N G
. . . comments & more!