Let's Build an MLOps Pipeline With Databricks and Spark—Part 3: CI/CD Automation and Deployment

In the first two parts of this series (part 1, part 2), we covered the following: part 1 part 2 Setting Up the Unity Catalog for Medallion Architecture: We organized our data into bronze, silver, and gold layers within the Unity Catalog, establishing a structured and efficient data management system. Setting Up the Unity Catalog for Medallion Architecture: We organized our data into bronze, silver, and gold layers within the Unity Catalog, establishing a structured and efficient data management system. Setting Up the Unity Catalog for Medallion Architecture Ingesting Data into Unity Catalog: We demonstrated how to import raw data into the system, ensuring consistency and quality for subsequent processing stages. Ingesting Data into Unity Catalog: We demonstrated how to import raw data into the system, ensuring consistency and quality for subsequent processing stages. Ingesting Data into Unity Catalog Training the Model: Utilizing Databricks, we trained a machine learning model tailored to our dataset, following best practices for scalable and effective model development. Training the Model: Utilizing Databricks, we trained a machine learning model tailored to our dataset, following best practices for scalable and effective model development. Training the Model Hyperparameter Tuning with HyperOpt: To enhance model performance, we employed HyperOpt to automate the search for optimal hyperparameters, improving accuracy and efficiency. Hyperparameter Tuning with HyperOpt: To enhance model performance, we employed HyperOpt to automate the search for optimal hyperparameters, improving accuracy and efficiency. Hyperparameter Tuning with HyperOpt Experiment Tracking with Databricks MLflow: We utilized MLflow to log and monitor our experiments, maintaining a comprehensive record of model versions, metrics, and parameters for easy comparison and reproducibility. Experiment Tracking with Databricks MLflow: We utilized MLflow to log and monitor our experiments, maintaining a comprehensive record of model versions, metrics, and parameters for easy comparison and reproducibility. Experiment Tracking with Databricks MLflow Batch Inference: Implementing batch processing to generate predictions on large datasets, suitable for applications like bulk scoring and periodic reporting. Batch Inference: Implementing batch processing to generate predictions on large datasets, suitable for applications like bulk scoring and periodic reporting. Batch Inference Online Inference (Model Serving): Setting up real-time model serving to provide immediate predictions, essential for interactive applications and services. Online Inference (Model Serving): Setting up real-time model serving to provide immediate predictions, essential for interactive applications and services. Online Inference (Model Serving) Model Monitoring: to ensure your deployed models maintain optimal performance and reliability over time. Model Monitoring: to ensure your deployed models maintain optimal performance and reliability over time. Model Monitoring: In this last part we take see how we can automate the whole process using Gitlab, Databricks Assets Bundles, and Databricks jobs. Gitlab Databricks Assets Bundles Databricks jobs Let's dive in! Orchestration Databricks offers various tools for programmatically automating the management of jobs and workflows. In particular, they simplify Databricks workflows development, deployment, and launch across multiple environments. In principle, all these tools build around Databricks Rest API, which allows us to manage and control Databricks resources such as clusters, workspaces, workflows, and machine learning experiments and models. They are built to be actively used both inside CI/CD pipelines and as a part of local tooling for rapid prototyping. Databricks Rest API Here a list of some of these tools: Databricks CLI eXtentions, aka. dbx (Legacy): It is one of the first generations of such tools. You can read my blog post on how to use DBX to build a CI pipeline. Databricks is currently not developing this tool and recommends using any either of the tools below. Databricks CLI eXtentions, aka. dbx (Legacy): It is one of the first generations of such tools. You can read my blog post on how to use DBX to build a CI pipeline. Databricks is currently not developing this tool and recommends using any either of the tools below. Databricks CLI eXtentions aka. dbx how to use DBX to build a CI pipeline Databricks Asset Bundles aka. dab: This is the recommended Databricks deployment framework for streamlining the development of complex data, analytics, and ML projects for the Databricks platform. Developers use YAML syntax to declare resources and configurations. DAB allows developers to modularize their source files and metadata that are used to provision and manage infrastructures and other resources. Databricks Asset Bundles aka. dab: This is the recommended Databricks deployment framework for streamlining the development of complex data, analytics, and ML projects for the Databricks platform. Developers use YAML syntax to declare resources and configurations. DAB allows developers to modularize their source files and metadata that are used to provision and manage infrastructures and other resources. Databricks Asset Bundles aka. dab: Databricks SDK for Python (Beta): This is the latest Databricks tool (currently in Beta as of version 0.53.0). With this tool, Databricks aims to reduce the overhead of managing resources for Python developers. Python SDK allows developer set up, configure, and build resources using Python scripts. Databricks SDK for Python (Beta): This is the latest Databricks tool (currently in Beta as of version 0.53.0). With this tool, Databricks aims to reduce the overhead of managing resources for Python developers. Python SDK allows developer set up, configure, and build resources using Python scripts. Databricks SDK for Python Beta 0.53.0). In this tutorial, we will focus on Databricks Asset Bundles. CI/CD pipeline using Asset Bundles In my previous blog I described how we can build a simple CI pipeline using Gitlab and Databricks. I suggest going through that post before you continue reading. There I described the structure and different components of such projects in more detail. Here, I’d like to expand on that piece by adding the following: previous blog multi-environment setup (dev, stag, prod)CD pipeline multi-environment setup (dev, stag, prod) multi-environment setup (dev, stag, prod) CD pipeline CD pipeline As mentioned at the beginning of this blog series, I'll try to follow the proposed Databricks reference architecture as closely as possible. I'll Implement a three-stage deployment architecture and follow the deploy-code model deployment pattern instead of the deploy-models as described here Databricks reference architecture model deployment pattern here Databricks recommends using separate environments or workspaces for each stage. However, to keep resource usage minimal for this tutorial, I use a single workspace for all stages. To separate the data and artifacts for each stage, we set up a unique catalog for each environment within the workspace. We also store the bundle data for each stage in separate folders. But I will show you how you can adapt the code here for setting up separate workspaces. Here is a summary of what happens in each stage: Development: this is the experimentation stage where I'll develop and test code for data ingestion and transformation, feature engineering, model training, model optimization, and deployment. Basicall,y all the code that we saw so far is developed in this stage. Development: this is the experimentation stage where I'll develop and test code for data ingestion and transformation, feature engineering, model training, model optimization, and deployment. Basicall,y all the code that we saw so far is developed in this stage. Development Staging: In practice, this is the stage where we should test our pipelines to make sure they are ready for production. This is also where we should run our unit, integration, and other tests. What is important here is that the staging environment should match the production environment as closely as is reasonable. To keep things simple, we are not spending any time on developing tests. We use a few simple test cases just to demonstrate the git workflow. I might devote a blog post to this topic in the future. Staging: In practice, this is the stage where we should test our pipelines to make sure they are ready for production. This is also where we should run our unit, integration, and other tests. What is important here is that the staging environment should match the production environment as closely as is reasonable. To keep things simple, we are not spending any time on developing tests. We use a few simple test cases just to demonstrate the git workflow. I might devote a blog post to this topic in the future. Staging Production: This is the final station. The code and artifacts in this stage are used for real-world scenarios, for example, showing recommendations to users and capturing their interactions with your applications. Production: This is the final station. The code and artifacts in this stage are used for real-world scenarios, for example, showing recommendations to users and capturing their interactions with your applications. Production OAuth M2M Authentication (Service Principal) In part 2 (linked above), I showed you how to use Databricks personal access token authentication. However, the recommended authentication method for unattended authentication scenarios such as CI/CD or Airflow data pipelines is to use the Service Principal. In this tutorial, I've used the service principal but only for the production environment. unattended To read more about the advantage of service principle and how to create one in your workspace read this page. Some important notes: read this page To create a service principal you must be an account admin.Service principals are always in the form of an application ID, which can be retrieved from a Service principal’s page in your workspace admin settings To create a service principal you must be an account admin. Service principals are always in the form of an application ID, which can be retrieved from a Service principal’s page in your workspace admin settings application ID For users or groups that need to use the service principal, the Service principal: User role should be granted explicitly. In this case, I also have Manager permissions. You can add a user by clicking the Grant access button. Service principal: User Manager Grant access Configure DAB The first step is to set up bundle files. The Databricks.yml is the heart of our project. This is where we define all our workflows and relevant resources. I’ve already explained different components of this file in the previous blog. This time we would have: Databricks.yml three stages instead of one.different configurations for staging and production embodiments than the deploymentservice principal instead of personal token for authentication for production three stages instead of one. different configurations for staging and production embodiments than the deployment service principal instead of personal token for authentication for production We’ll also look at how to modularize our bundle by separating it into different files. For this tutorial, I’ve divided the configuration file into three modules. databricks.yml: the parent configuration file that define the general structure of our bundleresources.yml: define the default workflows and jobstarget.yml: target-specific workflows, jobs, and configuration databricks.yml: the parent configuration file that define the general structure of our bundle databricks.yml resources.yml: define the default workflows and jobs resources.yml target.yml: target-specific workflows, jobs, and configuration target.yml Let’s start with the highest-level configuration file of our bundle, Databricks.yml file: Databricks.yml # yaml-language-server: $schema=bundle_config_schema.json bundle: name: DAB_tutorialbnu variables: my_cluster_id: description: The ID of an existing cluster. default: service_principal_name: description: Service Principal Name for the production environment. default: 00-00 #use some random value as a fill-in workspace: profile: asset-bundle-tutorial # host: include: - ./bundle/*.yml # yaml-language-server: $schema=bundle_config_schema.json bundle: name: DAB_tutorialbnu variables: my_cluster_id: description: The ID of an existing cluster. default: service_principal_name: description: Service Principal Name for the production environment. default: 00-00 #use some random value as a fill-in workspace: profile: asset-bundle-tutorial # host: include: - ./bundle/*.yml On the first line, we see yaml-language-server: $schema=bundle_config_schema.json. Databricks Asset Bundle configuration uses JSON schema to make sure our config file has the right format. You can generate this file by using Databricks CLI version 0.205 or above $schema=bundle_config_schema.json databricks bundle schema > bundle_config_schema.json databricks bundle schema > bundle_config_schema.json In the variable mapping we define two variables: variable mapping the id of an existing cluster in our workspace that we like to use for running our jobs in the development environment.the name of our service principle for running jobs in the production environment. The reason for defining this variable is that we want to set it during the run time when we deploy and run our jobs on the production environment. The value of the variable is just a place holder. don’t use the actual application id here! the id of an existing cluster in our workspace that we like to use for running our jobs in the development environment. existing cluster the name of our service principle for running jobs in the production environment. The reason for defining this variable is that we want to set it during the run time when we deploy and run our jobs on the production environment. The value of the variable is just a place holder. don’t use the actual application id here! service principle during the run time In the workspace mapping, we define our default workspace configuration, such as the host address or the profile name. Since we only have one workspace, we can use the information here for all the stages of our pipeline. workspace mapping Finally in include mapping we import different modules of our bundle configuration into the databricks.yml file. I put all these modules into the folder named resources that includes two files: include mapping databricks.yml Define workflows We define all our workflows and jobs in the bundle/resources.yml file. We define four workflows: bundle/resources.yml environment initialization workflow ( init_workflow ): to create all the necessary catalogs, schemas, tables, and volumes.data ingestion workflow (ingestion_workflow): to ingest data from the sources, perform the necessary formatting and transformation, and write it to the right catalog and schemamodel training workflow: to train, optimize, deploy, and monitor the modeltesting workflow: to perform the unit and integration tests environment initialization workflow ( init_workflow ): to create all the necessary catalogs, schemas, tables, and volumes. init_workflow data ingestion workflow (ingestion_workflow): to ingest data from the sources, perform the necessary formatting and transformation, and write it to the right catalog and schema ingestion_workflow model training workflow: to train, optimize, deploy, and monitor the model testing workflow: to perform the unit and integration tests Take a look at this file: resources: jobs: init_workflow: name: "[${bundle.target}]-init" tasks: - task_key: setup-env notebook_task: notebook_path: ../notebooks/1_initiate_env.py source: WORKSPACE base_parameters: env: ${bundle.target} existing_cluster_id: ${var.my_cluster_id} ingestion_workflow: name: "[${bundle.target}]-ingestion" tasks: - task_key: ingest notebook_task: notebook_path: ../notebooks/2_ingest.py source: WORKSPACE base_parameters: env: ${bundle.target} existing_cluster_id: ${var.my_cluster_id} - task_key: feature-store depends_on: - task_key: ingest notebook_task: notebook_path: ../notebooks/3_transform.py source: WORKSPACE base_parameters: env: ${bundle.target} existing_cluster_id: ${var.my_cluster_id} training_workflow: name: "[${bundle.target}]-training" tasks: - task_key: training notebook_task: notebook_path: ../notebooks/4_training.py source: WORKSPACE base_parameters: env: ${bundle.target} existing_cluster_id: ${var.my_cluster_id} libraries: - pypi: package: pip install --upgrade "mlflow-skinny[databricks]" - task_key: batch_inference depends_on: - task_key: training notebook_task: notebook_path: ../notebooks/6_batch_inference.py source: WORKSPACE base_parameters: env: ${bundle.target} existing_cluster_id: ${var.my_cluster_id} - task_key: deployment depends_on: - task_key: training notebook_task: notebook_path: ../notebooks/5_deployment.py source: WORKSPACE base_parameters: env: ${bundle.target} existing_cluster_id: ${var.my_cluster_id} unity_test_workflow: name: "[${bundle.target}]-unity_test" tasks: - task_key: unity_test existing_cluster_id: ${var.my_cluster_id} notebook_task: notebook_path: ../notebooks/run_unit_test.py source: WORKSPACE libraries: - pypi: package: pytest resources: jobs: init_workflow: name: "[${bundle.target}]-init" tasks: - task_key: setup-env notebook_task: notebook_path: ../notebooks/1_initiate_env.py source: WORKSPACE base_parameters: env: ${bundle.target} existing_cluster_id: ${var.my_cluster_id} ingestion_workflow: name: "[${bundle.target}]-ingestion" tasks: - task_key: ingest notebook_task: notebook_path: ../notebooks/2_ingest.py source: WORKSPACE base_parameters: env: ${bundle.target} existing_cluster_id: ${var.my_cluster_id} - task_key: feature-store depends_on: - task_key: ingest notebook_task: notebook_path: ../notebooks/3_transform.py source: WORKSPACE base_parameters: env: ${bundle.target} existing_cluster_id: ${var.my_cluster_id} training_workflow: name: "[${bundle.target}]-training" tasks: - task_key: training notebook_task: notebook_path: ../notebooks/4_training.py source: WORKSPACE base_parameters: env: ${bundle.target} existing_cluster_id: ${var.my_cluster_id} libraries: - pypi: package: pip install --upgrade "mlflow-skinny[databricks]" - task_key: batch_inference depends_on: - task_key: training notebook_task: notebook_path: ../notebooks/6_batch_inference.py source: WORKSPACE base_parameters: env: ${bundle.target} existing_cluster_id: ${var.my_cluster_id} - task_key: deployment depends_on: - task_key: training notebook_task: notebook_path: ../notebooks/5_deployment.py source: WORKSPACE base_parameters: env: ${bundle.target} existing_cluster_id: ${var.my_cluster_id} unity_test_workflow: name: "[${bundle.target}]-unity_test" tasks: - task_key: unity_test existing_cluster_id: ${var.my_cluster_id} notebook_task: notebook_path: ../notebooks/run_unit_test.py source: WORKSPACE libraries: - pypi: package: pytest NOTE: Make sure the notebook files are specified with .py file extension, otherwise you'll get a "no such resource" error notice. NOTE Pass Runtime Context Variables to Notebooks Most of the definitions in the code above are similar to what we saw in the previous blog. The main difference is the use of the base_parameters field for the notebook tasks. This allows us to pass context about a task to the notebook. In our case, we use it to pass the environment name to each notebook so it can apply the correct settings when running the code. base_parameters For example, in the 1.init_env.py notebook, we create and use a catalog as follows: 1.init_env.py import json with open('config.json') as config_file: config = json.load(config_file) catalog_name = config['catalog_name'] spark.sql(f"CREATE CATALOG IF NOT EXISTS {catalog_name}") spark.sql(f"USE CATALOG {catalog_name}") #-- create all the neccessary schemas within our catalog spark.sql(f"CREATE SCHEMA IF NOT EXISTS {boronze_layer}") spark.sql(f"CREATE SCHEMA IF NOT EXISTS {silver_layer}") spark.sql(f"CREATE SCHEMA IF NOT EXISTS {gold_layer}") spark.sql(f"CREATE SCHEMA IF NOT EXISTS {output_schema}") import json with open('config.json') as config_file: config = json.load(config_file) catalog_name = config['catalog_name'] spark.sql(f"CREATE CATALOG IF NOT EXISTS {catalog_name}") spark.sql(f"USE CATALOG {catalog_name}") #-- create all the neccessary schemas within our catalog spark.sql(f"CREATE SCHEMA IF NOT EXISTS {boronze_layer}") spark.sql(f"CREATE SCHEMA IF NOT EXISTS {silver_layer}") spark.sql(f"CREATE SCHEMA IF NOT EXISTS {gold_layer}") spark.sql(f"CREATE SCHEMA IF NOT EXISTS {output_schema}") What we need, now, is to set the catalog name based on the stage that is currently running the code. For this, we need to set this value at runtime, when we deploy and run our bundle. This communication occurs through the use of base_parameter in our bundle file and notebook widgets. In this case, we pass ${bundle.target}. notebook widgets This is how the new code looks: This is how the new code looks: dbutils.widgets.text(name="env", defaultValue="staging", label="Environment Name") env = dbutils.widgets.get("env") #... catalog_name = f"{config['catalog_name']}_{env}" #..... dbutils.widgets.text(name="env", defaultValue="staging", label="Environment Name") env = dbutils.widgets.get("env") #... catalog_name = f"{config['catalog_name']}_{env}" #..... To see what other context information you can send to your notebook, check out Substitutions in bundle configuration. Substitutions in bundle configuration Environment-specific Configuration In the CI/CD process, our production and staging environments use different settings and resources for running the jobs. For example, we use larger datasets to train our model, or use clusters with larger resources to serve the users. Asset bundles allow us to partially overwrite our job and workflow definitions that fit the needs of sa pecific target/stage. That is, we can change certain parameters in our default workflow in resouces.yml for each target. Let’s see how it works by looking at the target.yml file: new_cluster: &new_cluster new_cluster: num_workers: 3 spark_version: 13.3.x-cpu-ml-scala2.12 node_type_id: i3.xlarge autoscale: min_workers: 1 max_workers: 3 custom_tags: clusterSource: prod_13.3 targets: # The 'dev' target, used for development purposes. # Whenever a developer deploys using 'dev', they get their own copy. dev: # We use 'mode: development' to make sure everything deployed to this target gets a prefix # like '[dev my_user_name]'. Setting this mode also disables any schedules and # automatic triggers for jobs and enables the 'development' mode for Delta Live Tables pipelines. mode: development default: true staging: workspace: host: root_path: /Shared/staging-workspace/.bundle/${bundle.name}/${bundle.target} resources: jobs: playground_workflow: name: ${bundle.target}-${var.model_name} job_clusters: - job_cluster_key: model_training_job_cluster <<: *new_cluster tasks: - task_key: playground job_cluster_key: model_training_job_cluster notebook_task: base_parameters: workload_size: Medium scale_to_zero_enabled: "False" ingestion_workflow: name: "[${bundle.target}]-ingestion" job_clusters: - job_cluster_key: model_training_job_cluster <<: *new_cluster tasks: - task_key: ingest job_cluster_key: model_training_job_cluster - task_key: feature-store job_cluster_key: model_training_job_cluster training_workflow: name: "[${bundle.target}]-training" job_clusters: - job_cluster_key: model_training_job_cluster <<: *new_cluster tasks: - task_key: training job_cluster_key: model_training_job_cluster - task_key: batch_inference job_cluster_key: model_training_job_cluster - task_key: deployment job_cluster_key: model_training_job_cluster notebook_task: base_parameters: env: ${bundle.target} workload_size: Medium scale_to_zero_enabled: False depends_on: - task_key: training production: mode: production workspace: host: root_path: /Shared/production-workspace/.bundle/${bundle.name}/${bundle.target} variables: service_principal_name: description: Service Principal Name for the production environment. run_as: service_principal_name: ${var.service_principal_name} resources: jobs: ingestion_workflow: name: "[${bundle.target}]-ingestion" job_clusters: - job_cluster_key: model_training_job_cluster <<: *new_cluster tasks: - task_key: ingest job_cluster_key: model_training_job_cluster - task_key: feature-store job_cluster_key: model_training_job_cluster training_workflow: name: "[${bundle.target}]-training" job_clusters: - job_cluster_key: model_training_job_cluster <<: *new_cluster tasks: - task_key: training job_cluster_key: model_training_job_cluster - task_key: batch_inference job_cluster_key: model_training_job_cluster - task_key: deployment job_cluster_key: model_training_job_cluster notebook_task: base_parameters: env: ${bundle.target} workload_size: Medium scale_to_zero_enabled: False depends_on: - task_key: training new_cluster: &new_cluster new_cluster: num_workers: 3 spark_version: 13.3.x-cpu-ml-scala2.12 node_type_id: i3.xlarge autoscale: min_workers: 1 max_workers: 3 custom_tags: clusterSource: prod_13.3 targets: # The 'dev' target, used for development purposes. # Whenever a developer deploys using 'dev', they get their own copy. dev: # We use 'mode: development' to make sure everything deployed to this target gets a prefix # like '[dev my_user_name]'. Setting this mode also disables any schedules and # automatic triggers for jobs and enables the 'development' mode for Delta Live Tables pipelines. mode: development default: true staging: workspace: host: root_path: /Shared/staging-workspace/.bundle/${bundle.name}/${bundle.target} resources: jobs: playground_workflow: name: ${bundle.target}-${var.model_name} job_clusters: - job_cluster_key: model_training_job_cluster <<: *new_cluster tasks: - task_key: playground job_cluster_key: model_training_job_cluster notebook_task: base_parameters: workload_size: Medium scale_to_zero_enabled: "False" ingestion_workflow: name: "[${bundle.target}]-ingestion" job_clusters: - job_cluster_key: model_training_job_cluster <<: *new_cluster tasks: - task_key: ingest job_cluster_key: model_training_job_cluster - task_key: feature-store job_cluster_key: model_training_job_cluster training_workflow: name: "[${bundle.target}]-training" job_clusters: - job_cluster_key: model_training_job_cluster <<: *new_cluster tasks: - task_key: training job_cluster_key: model_training_job_cluster - task_key: batch_inference job_cluster_key: model_training_job_cluster - task_key: deployment job_cluster_key: model_training_job_cluster notebook_task: base_parameters: env: ${bundle.target} workload_size: Medium scale_to_zero_enabled: False depends_on: - task_key: training production: mode: production workspace: host: root_path: /Shared/production-workspace/.bundle/${bundle.name}/${bundle.target} variables: service_principal_name: description: Service Principal Name for the production environment. run_as: service_principal_name: ${var.service_principal_name} resources: jobs: ingestion_workflow: name: "[${bundle.target}]-ingestion" job_clusters: - job_cluster_key: model_training_job_cluster <<: *new_cluster tasks: - task_key: ingest job_cluster_key: model_training_job_cluster - task_key: feature-store job_cluster_key: model_training_job_cluster training_workflow: name: "[${bundle.target}]-training" job_clusters: - job_cluster_key: model_training_job_cluster <<: *new_cluster tasks: - task_key: training job_cluster_key: model_training_job_cluster - task_key: batch_inference job_cluster_key: model_training_job_cluster - task_key: deployment job_cluster_key: model_training_job_cluster notebook_task: base_parameters: env: ${bundle.target} workload_size: Medium scale_to_zero_enabled: False depends_on: - task_key: training Here we make three changes for the staging and production stages. Instead of using the existing cluster that we used for our development, we want to run our job using a larger job compute with auto-balancing. For this, we define a new cluster and use the &new_cluster anchor to refer it in each job.Similarly, we want to deploy our modeling serving endpoint using a larger compute instance. For this, we update the base_parameters in deployment taskspecify root_path under the workspace mapping to store the artifacts and files of each stage in a different folder in our workspace. Instead of using the existing cluster that we used for our development, we want to run our job using a larger job compute with auto-balancing. For this, we define a new cluster and use the &new_cluster anchor to refer it in each job. &new_cluster Similarly, we want to deploy our modeling serving endpoint using a larger compute instance. For this, we update the base_parameters in deployment task base_parameters deployment specify root_path under the workspace mapping to store the artifacts and files of each stage in a different folder in our workspace. root_path workspace You can see that we don't need to rewrite the whole job tasks for each target again, but only 1) the job definition and task mapping, and 2) parameter that you wish to change or add. Databricks uses the job definition to join the job tasks settings in a top-level resources mapping with the job task settings in a targets mapping. More about this in this Databricks article. resources targets Databricks article Additionally, we use Databricks Asset Bundle deployment modes for development and production environments. They provide an optional collection of default behaviors that correspond to each of these modes. Databricks Asset Bundle deployment modes Databricks Asset Bundle Finally, in the production environment, we use the run_as mapping to specify the identity to use when running Databricks Asset Bundles workflows. In this case, we set its value to the variable that we defined earlier in our databricks.yml file. run_as databricks.yml NOTE: If you want to use different workspaces for each environment, change the host URL under the workspace mapping! NOTE Git workflow Defining the right git strategy depends on many factors such as the team size and composition, project size, and deployment life cycle. Here we follow the Databricks standard workflow as described here. here Development: ML code is developed in the development environment, with code pushed to a dev (or feature) branch.Testing: Upon making a pull request from the dev branch to the main branch, a CI trigger runs unit tests on the CI runner and integration tests in the staging environment.Merge code: After successfully passing these tests, changes are merged from the dev branch to the main branch.Release code: The release branch is cut from the main branch, and doing so deploys the project ML pipelines to the production environment. Development: ML code is developed in the development environment, with code pushed to a dev (or feature) branch. Development Testing: Upon making a pull request from the dev branch to the main branch, a CI trigger runs unit tests on the CI runner and integration tests in the staging environment. Testing Merge code: After successfully passing these tests, changes are merged from the dev branch to the main branch. Merge code Release code: The release branch is cut from the main branch, and doing so deploys the project ML pipelines to the production environment. Release code Setup the Repo Setup the Repo To setup the repo we create three branches. main, dev, release. main would be our default branch. Make sure that the main and the release branch are protected. That is you can’t updated them manually but only through merge requests. Then go ahead and clone your Databricks repo. I assume you’ve already integrated your Git with your Databricks workspace. main dev release protected Next, we add the token, host, and service principal application ID to our Gitlab CI/CD settings. For that, open your repo in Gitlab, go to Settings→ CI/CD → Variables → Add variables Settings→ CI/CD → Variables → Add variables Define the Pipeline Now we need to define our CI/CD pipeline in the .gitlab-ci.yml . We define two stages: .gitlab-ci.yml onMerge: will be triggered when we merge the dev or feature branch to the main branch. It will run our unit-test and integration-test jobsonRlease: will be triggered when we push or merge the changes from the main branch to the release branch onMerge: will be triggered when we merge the dev or feature branch to the main branch. It will run our unit-test and integration-test jobs onMerge unit-test integration-test onRlease: will be triggered when we push or merge the changes from the main branch to the release branch onRlease image: python:3.10 stages: # List of stages for jobs, and their order of execution - onMerge - onRelease default: before_script: - echo "install databricks cli" - curl -V - curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh - echo "databricks CLI installation finished" - echo "Creating the configuration profile for token authentication..." - | { echo "[asset-bundle-tutorial]" echo "token = $DATABRICKS_TOKEN" echo "host = $DATABRICKS_HOST" } > ~/.databrickscfg - echo "validate the bundle" - databricks bundle validate after_script: - echo "remove all workflows" #- databricks bundle destroy --auto-approve unity-test: stage: onMerge script: - echo "--- Running the unit tests" - databricks bundle deploy -t dev - databricks bundle run -t dev unity_test_workflow - databricks bundle destroy --auto-approve -t dev rules: - if: '$CI_PIPELINE_SOURCE == "merge_request_event" && $CI_MERGE_REQUEST_TARGET_BRANCH_NAME == "main"' integration-test: stage: onMerge needs: - unity-test script: - echo " --- Running the integeration tests on the dev env" # - echo "validate bundle staging" - databricks bundle validate -t staging - databricks bundle deploy -t staging - databricks bundle run -t staging init_workflow - databricks bundle run -t staging ingestion_workflow - databricks bundle run -t staging training_workflow #- databricks bundle destroy --auto-approve -t staging rules: - if: '$CI_PIPELINE_SOURCE == "merge_request_event" && $CI_MERGE_REQUEST_TARGET_BRANCH_NAME == "main"' deploy_for_prod: #This job runs in the deploy stage. stage: onRelease script: - echo "validate bundle production" - databricks bundle validate --var="service_principal_name=$var_spn_prod" -t production - echo "Deploying jobs" - databricks bundle deploy --var="service_principal_name=$var_spn_prod" -t production # - databricks bundle run -t prod ingestion_workflow # - databricks bundle run -t prod training_workflow - echo "Application successfully deployed for production" rules: - if: '$CI_PIPELINE_SOURCE == "merge_request_event" && $CI_MERGE_REQUEST_TARGET_BRANCH_NAME == "release" || $CI_MERGE_REQUEST_TARGET_BRANCH_NAME == "testi"' image: python:3.10 stages: # List of stages for jobs, and their order of execution - onMerge - onRelease default: before_script: - echo "install databricks cli" - curl -V - curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh - echo "databricks CLI installation finished" - echo "Creating the configuration profile for token authentication..." - | { echo "[asset-bundle-tutorial]" echo "token = $DATABRICKS_TOKEN" echo "host = $DATABRICKS_HOST" } > ~/.databrickscfg - echo "validate the bundle" - databricks bundle validate after_script: - echo "remove all workflows" #- databricks bundle destroy --auto-approve unity-test: stage: onMerge script: - echo "--- Running the unit tests" - databricks bundle deploy -t dev - databricks bundle run -t dev unity_test_workflow - databricks bundle destroy --auto-approve -t dev rules: - if: '$CI_PIPELINE_SOURCE == "merge_request_event" && $CI_MERGE_REQUEST_TARGET_BRANCH_NAME == "main"' integration-test: stage: onMerge needs: - unity-test script: - echo " --- Running the integeration tests on the dev env" # - echo "validate bundle staging" - databricks bundle validate -t staging - databricks bundle deploy -t staging - databricks bundle run -t staging init_workflow - databricks bundle run -t staging ingestion_workflow - databricks bundle run -t staging training_workflow #- databricks bundle destroy --auto-approve -t staging rules: - if: '$CI_PIPELINE_SOURCE == "merge_request_event" && $CI_MERGE_REQUEST_TARGET_BRANCH_NAME == "main"' deploy_for_prod: #This job runs in the deploy stage. stage: onRelease script: - echo "validate bundle production" - databricks bundle validate --var="service_principal_name=$var_spn_prod" -t production - echo "Deploying jobs" - databricks bundle deploy --var="service_principal_name=$var_spn_prod" -t production # - databricks bundle run -t prod ingestion_workflow # - databricks bundle run -t prod training_workflow - echo "Application successfully deployed for production" rules: - if: '$CI_PIPELINE_SOURCE == "merge_request_event" && $CI_MERGE_REQUEST_TARGET_BRANCH_NAME == "release" || $CI_MERGE_REQUEST_TARGET_BRANCH_NAME == "testi"' Our CI/CD workflow consists of some manual steps and automated steps: push the changes to the dev branchcreate a merge request to merge the changes from dev to the main branch (manual)run the unity and integration tests (automatic)merge the changes when all testing pipeline succussed (manual)create a merge request from the main branch to release branch (manual)deploy the job to the Databricks production environment (automatic) push the changes to the dev branch create a merge request to merge the changes from dev to the main branch (manual) manual run the unity and integration tests (automatic) merge the changes when all testing pipeline succussed (manual) manual create a merge request from the main branch to release branch (manual) manual deploy the job to the Databricks production environment (automatic) It is possible to automate the whole process as part of our CI/CD pipeline. We can also add new stages/steps like creating tags or releases. But for now, we skip to this 😉 tags releases After running the pipeline, you should see the following workflows on your Databricks Workflows window. We don’t see the dev workflows because we run the databricks bundle destroy -t dev as part of the unity-test step. databricks bundle destroy -t dev unity-test If you check your Workspace → Shared folder, you two separate folders for your staging and production bundle files. you can find the bundle files for your dev environments under workspace/Users/ /.bundle/ . Similarly, you would find different experiment names for each environment in your Databricks Experiments window default: /Users/${workspace.current_user.userName}/${bundle.target}-my_mlops_project-experiment default: /Users/${workspace.current_user.userName}/${bundle.target}-my_mlops_project-experiment Two notes about passing the service_principle_name: service_principle_name There are different options for setting the service_principle_name in the CI/CD pipeline. Here we use the --var option as part of our bundle command.The variable should be defined at the top level and then overwritten for a specific target. If we define a variable only in a target env, we will get the error Error: variable service_principal_name is not defined but is assigned a value There are different options for setting the service_principle_name in the CI/CD pipeline. Here we use the --var option as part of our bundle command. options for setting the service_principle_name --var bundle The variable should be defined at the top level and then overwritten for a specific target. If we define a variable only in a target env, we will get the error Error: variable service_principal_name is not defined but is assigned a value Error: variable service_principal_name is not defined but is assigned a value Databricks MLOps Stacks Databricks provides Databricks MLOps Stacks to reduce the overhead of setting up everything from scratch, as we did here. It’s a great tool that gives you a head start to setting up your ML projects. I also adapted part of this tutorial from their template. However, building things up from scratch always helps me to understand the details and thinking behind the processes and tools. One thing that I did differently from the MLOps Stack was to configure and define my ML model and experiments through the bundle. But I think if you understand the principle and thinking behind the asset bundle, you can easily adapt it for your project. Make sure you check it out and adapt it to your needs. Databricks MLOps Stacks That's it! Hope this blog series helps you build some great things. Any feedback you have for me would be much appreciated! And as one always: happy building :)