Databricks is emerging as one of the main players in the MLOps and DevOps world. In the last month, I experienced this as part of my project at Areto and decided to write a series of hands-on blog posts on how to implement a MLOps Pipeline with Databricks. This is an ongoing learning process for me as well. I write each blog post after I experiment with different features and tools. So, I like to see this blog series as collective learning. Any comments and suggestions are welcome.
In this blog post, we'll explore how to build a Continuous Integration (CI) pipeline using the Databricks DBX tool and GitLab. This guide is designed for data engineers and scientists who are looking to streamline their data processing workflows. We'll cover key concepts, practical applications, and step-by-step instructions to help you integrate these tools effectively into your data projects.
Continuous Integration and Continuous Delivery/Deployment (CI/CD) have become fundamental in modern software development practices. At its core, CI/CD is a method that emphasizes short, frequent updates to software through automated pipelines. While traditionally associated with software development, its principles are increasingly being adopted in data engineering and data science. The essence of CI/CD lies in its ability to automate the stages of software development, particularly in building, testing, and deploying code. Continuous Integration begins with regular code commits to a shared repository branch, ensuring collaborative development without version conflicts. Each commit undergoes an automated build and test process, validating changes to enhance the quality and reliability of the final product.
Databricks has emerged as a premier cloud-based platform, uniting the realms of data engineering, machine learning, and analytics. Databricks excels in handling large-scale data processing and complex analytical tasks, but to leverage its full potential, teams need to navigate its multifaceted environment effectively. Integrating CI/CD processes within Databricks environments can streamline workflows, ensuring consistent, reproducible, and scalable data operations. This integration is crucial for teams aiming to maintain agility and efficiency in their data projects, enabling them to deliver reliable, high-quality data solutions consistently.
Databricks workflows are particularly significant in the context of CI/CD, as they offer a robust platform for automating and orchestrating data pipelines crucial to continuous integration and deployment. These workflows allow for the seamless scheduling and execution of tasks, which is essential for maintaining the frequent, automated update cycles characteristic of CI/CD. In a CI/CD pipeline, Databricks workflows can be used to automatically process and test large datasets, ensuring that data transformations and analyses are consistently accurate and up-to-date. The diversity of tasks that can be executed within Databricks workflows, such as Notebooks for exploratory data analysis, Python scripts for structured data processing, and Python wheels for custom package deployment, aligns well with the varied needs of CI/CD pipelines. By integrating these tasks into CI/CD workflows, data teams can ensure that every aspect of their data processing and analysis is continuously tested and integrated into the larger data strategy. This integration is key for developing resilient, scalable, and efficient data operations, enabling teams to deliver high-quality, reliable data products rapidly.
The Databricks CLI Extension, or DBX, is a pivotal tool in integrating Databricks with CI/CD pipelines, enhancing the automation and management of data workflows. The ability to programmatically control and manipulate data processes is crucial to implementing a CI/CD workflow, and DBX fills this role effectively. It provides a command-line interface for interacting with various Databricks components, such as workspaces, workflows, and clusters, facilitating seamless integration into automated pipelines.
In our hands-on example, we will create a (very) minimal project that uses dbx to deploy and run a Databricks workflow to manipulate, analyze, and test some data. This practical exercise showcases how Databricks can be used for real-world data analysis tasks. Our CI pipeline will play a crucial role in this process, as it will automate the deployment of our code as a Databricks workflow and validate its output. This validation step is vital to ensure the accuracy and reliability of our analysis. By the end of this exercise, you'll have a clear understanding of how to set up and run a Databricks workflow within a CI pipeline and how such a setup can be beneficial in analyzing and deriving insights from large datasets. This example will not only demonstrate the technical application of Databricks and CI principles but also offer a glimpse into the practical benefits of automated data analysis in a business context.
There are various development patterns to implement our CI pipeline. The common development workflows with dbx are to develop, test, and debug your code on your local environments and then use dbx to batch-run the local code on a target cluster. By integrating a remote repository with Databricks, we use a CI/CD platform such as GitHub Actions, Azure DevOps, or GitLab to automate running our remote repo’s code on our clusters.
In this tutorial, for the sake of simplicity, we use Databricks GUI to develop and test our code. We follow these steps:
Create a remote repository and clone it into our Databricks workspace. We use Gitlab here.
Develop the program logic and test it inside the Databricks GUI. This includes Python scripts to build a Python Wheel package, scripts to test the data quality using Pytest, and a notebook to run the Pytest.
Push the code to GitLab. The git push will trigger a Gitlab Runner to build, deploy and launch our workflows on Databricks using dbx
As the first step, we configure Git credentials & connect a remote repo to Databricks. Next, we create a remote repository and clone it to our Databricks repo. To allow our Gitlab runner to communicate with Databricks API through dbx, we should add two environment variables, DATABRICKS_HOST
and DATABRICKS_TOKEN
to our CI/CD pipeline configurations.
to generate a Databricks token, in your Databricks, go to User Settings → Developer → Access tokens → manage → Generate new token.
The Databricks host is the URL when you login into your Databricks workspace. It would be something like https://dbc-dc87ke16-h5h2.cloud.databricks.com/ . The last part of the URL is your workspace id, and you should ignore that.
Finally we add the token and host to our Gitlab CI/CD setting. For that open your repo in Gitlab, go to Settings→ CI/CD → Variables → Add variables
the project is structured into several folders and key files, each serving a specific purpose:
.dbx folder:
project.json
: defining the configuration of your DBX project. It contains environment settings, dependencies, and other project-specific parameters.conf folder:
deployment.yml
: outlining the deployment configurations and environment settings. It defines workflows and their respective parameters and cluster configurations.my_package folder: our wheel package. This folder includes:
tasks
subfolder containing the main ETL task script sample_etl_job.py
. The ETL task load our dataset and create two new tables.common.py
file. it includes common utilities that provide access to components such as SparkSession
notebooks folder: Contains two Jupyter notebooks:
explorative_analysis
: plot the distribution of different features in our datasetrun_unit_test
: used to execute pytest
for unit testingtests folder: This folder is dedicated to testing:
conftest.py
: includes pytest fixturestest_data.py
: contains unit tests to validate the data structure in our tables.In addition to these folders, there are two important files in the root directory:
.gitlab-ci.yml
: the configuration file for GitLab's Continuous Integration (CI) service, defining the instructions and commands the CI pipeline should execute.
setup.py
: for building my Python wheel package. It defines the package's metadata, dependencies, and build instructions.
dbx-tutorial/
├─ .dbx/
│ ├─ project.json
├─ conf/
│ ├─ deployment.yml
├─ my_package/
│ ├─ tasks/
│ │ ├─ __init__.py
│ │ ├─ sample_etl_job.py
│ ├─ __init__.py
│ ├─ common.py
├─ tests/
│ ├─ conftest.py
│ ├─ test_data.py
├─ notebooks/
│ ├─ explorative_analysis
│ ├─ run_unit_test
├─ .gitlab-ci.yml
├─ setup.py
project.json
file is crucial for defining your DBX project's configuration. This file could be generated automatically when you run dbx --init
. Check the dbx website for more details about the Profile file reference.
The “profile” option is used for local development. If you run the dbx
command inside the CI tool, you need to specify the databricks host
and databricks token
environment variables, and they will overwrite the profile variable. Since we don’t want to use the local environment and develop and run our code on Databricks UI, we don’t need to specify this option here. Instead, the dbx will pick them up from the CI/CD setting that we set earlier.
Your files that dbx will automatically upload to the system will be stored in ml experiment. They will be stored in artifact_location
. You can read more about this in dbx documentation
{
"environments": {
"default": {
"profile": "dbx-tutorial",
"storage_type": "mlflow",
"properties": {
"workspace_directory": "/Shared/dbx/dbx-tutorial",
"artifact_location": "dbfs:/Shared/dbx/projects/dbx-tutorial"
}
}
},
"inplace_jinja_support": false,
"failsafe_cluster_reuse_with_assets": false,
"context_based_upload_for_execute": false
}
deployment.yml
outlines the deployment configurations and environment settings. Here, we use 2.1 API and wheel_task format to define the workflows. Each workflow defines an object inside the Databricks workspace: job or delta live tables Pipeline.
custom:
existing_cluster_id: &existing_cluster_id
existing_cluster_id: "1064-xxxxxx-xxxxxxxxx"
environments:
default: #this should be same envoriment name as you defined in the project.json
workflows:
- name: "etl_job"
tasks:
- task_key: "main"
<<: *existing_cluster_id
python_wheel_task:
package_name: "my_package"
entry_point: "etl_job" # take a look at the setup.py entry_points section for details on how to define an entrypoint
- task_key: "eda"
<<: *existing_cluster_id
notebook_task:
notebook_path: "/Repos/<your DB username>/dbx-tutorial/notebooks/explorative_analysis"
source: WORKSPACE
depends_on:
- task_key: "main"
- name: "test_job"
tasks:
- task_key: "main"
<<: *existing_cluster_id
notebook_task:
notebook_path: "/Repos/<your DB username>/dbx-tutorial/notebooks/run_unit_test"
source: WORKSPACE
libraries:
- pypi:
package: pytest
Here, we define two workflows:
"etl_job"
: consists of two tasks/jobs.
main
) is of type python_wheel_task
. It used to run ETL task using the entry point that we define in the setup.py file.etl
) is of type notebook_task
. It is used to run the explorative_analysis
notebook after the main tasks run successfully. Notice the use of depends_on
property.
"test_job"
: consists of only one job of type notebook_task
. It is used to run the notebook that isresponsible for running the pytest.
Note: in the above example, we run the workflow on an existing cluster. You can also create and run a new cluster for every deployment and launch of your work. For that, you should change the code as follows:
custom:
basic-cluster-props: &basic-cluster-props
spark_version: "11.3.x-cpu-ml-scala2.12"
basic-static-cluster: &basic-static-cluster
new_cluster:
<<: *basic-cluster-props
num_workers: 1
node_type_id: "Standard_E8_v3"
environments:
default:
workflows:
- name: "et_job"
tasks:
- task_key: "main"
<<: *basic-static-cluster
....
Your GitLab CI pipeline is structured to automate the testing and deployment processes of your Databricks project. It consists of two main stages: test and deployment. In the test stage, the unit-test-job
runs the unit tests and deploys a separate workflow for testing. The deploy stage, activated upon successful completion of the test stage, handles the deployment of your main ETL workflow. In general, the pipeline follows these steps:
Build the project
Push the build artifacts to the Databricks workspace
Install the wheel package on your cluster
Create the jobs on Databricks Workflows
Run the jobs
image: python:3.9
stages: # List of stages for jobs, and their order of execution
- test
- deploy
unit-test-job: # This job runs in the test stage.
stage: test # It only starts when the job in the build stage completes successfully.
script:
- echo "Running unit tests... This will take about 60 seconds."
- echo "Code coverage is 90%"
- pip install -e ".[local]"
- dbx deploy --deployment-file conf/deployment.yml test_job --assets-only
- dbx launch test_job --from-assets --trace
deploy-job: # This job runs in the deploy stage.
stage: deploy # It only runs when *both* jobs in the test stage complete successfully.
script:
- echo "Deploying application..."
- echo "Install dependencies"
- pip install -e ".[local]"
- echo "Deploying Job"
- dbx deploy --deployment-file conf/deployment.yml etl_job
- dbx launch etl_job --trace
- echo "Application successfully deployed."
- echo "remove all workflows."
For workflow deployment and lunch, we use two commands:
dbx deploy
: to deploy the workflow definitions to the given environment.dbx lunch
: to launch the given workload by its name in a given environment.In the unit-test-job
we use the --assets-only
flag to avoid creating a job definition on our Databricks workflows during deployment. For the lunch command, we should use the --from-assets
flag. If this flag is provided, the launch command will search for the latest deployment that was assets-only. Lunching this, we can not see anything in the Workflows UI, but it will be a non-titled job running by submit API. You can see this on the Job Runs page **** of your Databricks workspace. Check out the dbx documentation to read about assest-based workflows and other commands
If you check the “Experiments” page of your Databricks workspace, you will find the build artifacts of each run of your CI pipeline.
currently, Databricks recommends using Asset Bundles for CI/CD and offers a migration guide from dbx to bundles. If you understand the concepts in this post, it would be easy to switch to bundles. In the next post, I am going to explain how we can convert this project to an asset bundle project.