Building CI Pipeline with Databricks Asset Bundle and GitLab

Introduction In the previous blog, I showed you how to build a CI pipeline using Databricks CLI eXtensions and GitLab. In this post, I will show you how to achieve the same objective with the latest and recommended Databricks deployment framework, Databricks Asset Bundles. DAB is actively supported and developed by the Databricks team as a new tool for streamlining the development of complex data, analytics, and ML projects for the Databricks platform. I will skip the general introduction of DAB and its features and refer you to the Databricks documentation. Here, I will focus on how to migrate our dbx project from the previous blog to DAB. Along the way, I will explain some concepts and features that can help you grasp each step better. Development pattern using Databricks GUI In the previous post, we used the Databricks GUI to develop and test our code and workflows. For this blog post, we want to be able to use our local environment to develop our code as well. The workflow will be as follows: Create a remote repository and clone it to our local environment and Databricks workspace. We use GitLab here. Develop the program logic and test it inside the Databricks GUI or on our local IDE. This includes Python scripts to build a Python Wheel package, scripts to test data quality using pytest, and a notebook to run the pytest. Push the code to GitLab. The git push will trigger a GitLab Runner to build, deploy, and launch resources on Databricks using Databricks Asset Bundles. Setting up your development environments Databricks CLI First of all, we need to install Databricks CLI version 0.205 or above on your local machine. To check your installed version of the Databricks CLI, run the command databricks -v. To install Databricks CLI version 0.205 or above, see Install or update the Databricks CLI. Authentication Databricks supports various authentication methods between the Databricks CLI on our development machine and your Databricks workspace. For this tutorial, we use Databricks personal access token authentication. It consists of two steps: Create a personal access token on our Databricks workspace. Create a Databricks configuration profile on our local machine. To generate a Databricks token in your Databricks workspace, go to User Settings → Developer → Access tokens → Manage → Generate new token. To create a configuration profile, create the file ~/.databrickscfg in your root folder with the following content: [asset-bundle-tutorial] host = https://xxxxxxxxxxx.cloud.databricks.com token = xxxxxxx Here, the asset-bundle-tutorial is our profile name, the host is the address of our workspace, and the token is the personal access token that we just created. You can create this file using the Databricks CLI by running databricks configure --profile asset-bundle-tutorial in your terminal. The command will prompt you for the Databricks Host and Personal Access Token. If you don’t specify the --profile flag, the profile name will be set to DEFAULT. Git integration (Databricks) As the first step, we configure Git credentials & connect a remote repo to Databricks . Next, we create a remote repository and clone it to our Databricks repo , as well as on our local machine. Finally we need set up authentication between the Databricks CLI on the Gitlab runner and our Databricks workspace. To do that, we should add two environment variables, DATABRICKS_HOST and DATABRICKS_TOKEN to our Gitlab CI/CD pipeline configurations. For that open your repo in Gitlab, go to Settings→ CI/CD → Variables → Add variables Both dbx and DAB are built around the Databricks REST APIs, so at their core, they are very similar. I will go through the steps to create a bundle manually from our existing dbx project. The first thing that we need to set up for our DAB project is the deployment configuration. In dbx, we use two files to define and set up our environments and workflows (jobs and pipelines). To set up the environment, we used .dbx/project.json, and to define the workflows, we used deployment.yml. In DAB, everything goes into databricks.yml, which is located in the root folder of your project. Here's how it looks: bundle: name: DAB_tutorial #our bundle name # These are for any custom variables for use throughout the bundle. variables: my_cluster_id: description: The ID of an existing cluster. default: xxxx-xxxxx-xxxxxxxx #The remote workspace URL and workspace authentication credentials are read from the caller’s local configuration profile named workspace: profile: asset-bundle-tutorial # These are the default job and pipeline settings if not otherwise overridden in # the following "targets" top-level mapping. resources: jobs: etl_job: tasks: - task_key: "main" existing_cluster_id: ${var.my_cluster_id} python_wheel_task: package_name: "my_package" entry_point: "etl_job" # take a look at the setup.py entry_points section for details on how to define an entrypoint libraries: - whl: ../dist/*.whl - task_key: "eda" existing_cluster_id: ${var.my_cluster_id} notebook_task: notebook_path: ../notebooks/explorative_analysis.py source: WORKSPACE depends_on: - task_key: "main" test_job: tasks: - task_key: "main_notebook" existing_cluster_id: ${var.my_cluster_id} notebook_task: notebook_path: ../notebooks/run_unit_test.py source: WORKSPACE libraries: - pypi: package: pytest # These are the targets to use for deployments and workflow runs. One and only one of these # targets can be set to "default: true". targets: # The 'dev' target, used for development purposes. # Whenever a developer deploys using 'dev', they get their own copy. dev: # We use 'mode: development' to make sure everything deployed to this target gets a prefix # like '[dev my_user_name]'. Setting this mode also disables any schedules and # automatic triggers for jobs and enables the 'development' mode for Delta Live Tables pipelines. mode: development default: true workspace: profile: asset-bundle-tutorial root_path: /Users/${workspace.current_user.userName}/.bundle/${bundle.name}/my-envs/${bundle.target} host: The databricks.yml bundle configuration file consists of sections called mappings. These mappings allow us to modularize the configuration file into separate logical blocks. There are 8 top-level mappings: bundle variables workspace artifacts include resources sync targets Here, we use five of these mappings to organize our project. bundle: In the bundle mapping, we define the name of the bundle. Here, we can also define a default cluster ID that should be used for our development environments, as well as information about the Git URL and branch. variables: We can use the variables mapping to define custom variables and make our configuration file more reusable. For example, we declare a variable for the ID of an existing cluster and use it in different workflows. Now, in case you want to use a different cluster, all you have to do is to change the variable value. resources: The resources mapping is where we define our workflows. It includes zero or one of each of the following mappings: experiments, jobs, models, and pipelines. This is basically our deployment.yml file in the dbx project. Though there are some minor differences: For the python_wheel_task, we must include the path to our wheel package; otherwise, Databricks can’t find the library. You can find more info about building wheel packages using DAB here. We can use relative paths instead of full paths to run the notebook tasks. The path for the notebook to deploy is relative to the databricks.yml file in which this task is declared. targets: The targets mapping is where we define the configurations and resources of different stages/environments of our projects. For example, for a typical CI/CD pipeline, we would have three targets: development, staging, and production. Each target can consist of all the top-level mappings (except targets) as child mappings. Here is the schema of the target mapping (databricks.yml). targets: : artifacts: ... bundle: ... compute_id: string default: true | false mode: development resources: ... sync: ... variables: : workspace: ... The child mapping allows us to override the default configurations that we defined earlier in the top-level mappings. For example, if we want to have an isolated Databricks workspace for each stage of our CI/CD pipeline, we should set the workspace child mapping for each target. workspace: profile: my-default-profile targets: dev: default: true test: workspace: host: https:// prod: workspace: host: https:// include: The include mapping allows us to break our configuration file into different modules. For example, we can save our resources and variables to the resources/dev_job.yml file and import it into our databricks.yml file. # yaml-language-server: $schema=bundle_config_schema.json bundle: name: DAB_tutorial #our bundle name workspace: profile: asset-bundle-tutorial include: - ./resources/*.yml targets: # The 'dev' target, used for development purposes. # Whenever a developer deploys using 'dev', they get their own copy. dev: # We use 'mode: development' to make sure everything deployed to this target gets a prefix # like '[dev my_user_name]'. Setting this mode also disables any schedules and # automatic triggers for jobs and enables the 'development' mode for Delta Live Tables pipelines. mode: development default: true For more detailed explanation of DAB configurations check out Databricks Asset Bundle configurations Workflows The workflows are exactly what I described in previous blog. The only differences is the location of artifacts and files. The project skeleton here is how the final project looks like ASSET-BUNDLE-TUTORAL/ ├─ my_package/ │ ├─ tasks/ │ │ ├─ __init__.py │ │ ├─ sample_etl_job.py │ ├─ __init__.py │ ├─ common.py ├─ test/ │ ├─ conftest.py │ ├─ test_sample.py ├─ notebooks/ │ ├─ explorative_analysis.py │ ├─ run_unit_test.py ├─ resources/ │ ├─ dev_jobs.yml ├─ .gitignore ├─ .gitlab-ci.yml ├─ databricks.yml ├─ README.md ├─ setup.py Validate, Deploy & Run Now, open your terminal and run the following commands from the root directory: validate: First, we should check if our configuration file has the right format and syntax. If the validation succeeds, you will get a JSON representation of the bundle configuration. In case of an error, fix it and run the command again until you receive the JSON file. databricks bundle validate deploy: Deployment includes building the Python wheel package and deploying it to our Databricks workspace, deploying the notebooks and other files to our Databricks workspace, and creating the jobs in our Databricks workflows. databricks bundle deploy If no command options are specified, the Databricks CLI uses the default target as declared within the bundle configuration files. Here, we only have one target so it doesn’t matter, but to demonstrate this, we can also deploy a specific target by using the -t dev flag. run: Run the deployed jobs. Here, we can specify which job we want to run. For example, in the following command, we run the test_job job in the dev target. databricks bundle run -t dev test_job in the output you get a URL to that points to the job run in your workspace. you can also find your jobs in he Workflow section of your Databricks workspace. CI pipeline configuration The general setup of our CI pipeline stays the same as the previous project. It consists of two main stages: test and deploy. In the test stage, the unit-test-job runs the unit tests and deploys a separate workflow for testing. The deploy stage, activated upon successful completion of the test stage, handles the deployment of your main ETL workflow. Here, we have to add additional steps before each stage for installing Databricks CLI and setting up the authentication profile. We do this in the before_script section of our CI pipeline. The before_script keyword is used to define an array of commands that should run before each job’s script commands. More about it can be found here. Optionally, you can use the after_project keyword to define an array of commands that should run AFTER each job. Here, we can use databricks bundle destroy --auto-approve to clean up after each job is over. In general, our pipeline go through these steps: Install the Databricks CLI and create configuration profile. Build the project. Push the build artifacts to the Databricks workspace. Install the wheel package on your cluster. Create the jobs on Databricks Workflows. Run the jobs. here is how our .gitlab-ci.yml looks like: image: python:3.9 stages: # List of stages for jobs, and their order of execution - test - deploy default: before_script: - echo "install databricks cli" - curl -V - curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh - echo "databricks CLI installation finished" - echo "create the configuration profile for token authentication" - echo "[asset-bundle-tutorial]" > ~/.databrickscfg - echo "token = $DATABRICKS_TOKEN" >> ~/.databrickscfg - echo "host = $DATABRICKS_HOST" >> ~/.databrickscfg - echo "validate the bundle" - databricks bundle validate after_script: - echo "remove all workflows" #- databricks bundle destroy --auto-approve unit-test-job: # This job runs in the test stage. stage: test script: - echo "Running unit tests." - pip3 install --upgrade wheel setuptools - pip install -e ".[local]" - databricks bundle deploy -t dev - databricks bundle run -t dev test_job deploy-job: # This job runs in the deploy stage. stage: deploy # It only runs when *both* jobs in the test stage complete successfully. script: - echo "Deploying application..." - echo "Install dependencies" - pip install -e ".[local]" - echo "Deploying Job" - databricks bundle deploy -t dev - databricks bundle run -t dev etl_job Notes Here are some notes that could help you set up your bundle project: In this blog, we created our bundle manually. In my experience, this helps to understand the underlying concepts and features better. But if you want to have a fast start with your project, you can use default and non-default bundle templates that are provided by Databricks or other parties. Check out this Databricks post to learn about how to initiate a project with the default Python template. When you deploy your code using databricks bundle deploy, Databricks CLI runs the command python3 setup.py bdist_wheel to build your package using the setup.py file. If you already have python3 installed but your machine uses the python alias instead of python3, you will run into problems. However, this is easy to fix. For example, here and here are two Stack Overflow threads with some solutions. What’s next In the next blog post, I will start with my first blog post on how to start a machine learning project on Databricks. It will be the first post in my upcoming end-to-end machine learning pipeline, covering everything from development to production. Stay tuned! Resouces repository for this tutorial. Make sure you update the cluster_id in resources/dev_jobs.yml Migrate from dbx to bundles | Databricks on AWS Databricks Asset Bundles development work tasks | Databricks on AWS Databricks Asset Bundle deployment modes | Databricks on AWS Develop a Python wheel by using Databricks Asset Bundles | Databricks on AWS Databricks Asset Bundles: A Standard, Unified Approach to Deploying Data Products on Databricks (youtube.com) repo and slides https://github.com/databricks/databricks-asset-bundles-dais2023 Introduction In the previous blog , I showed you how to build a CI pipeline using Databricks CLI eXtensions and GitLab. In this post, I will show you how to achieve the same objective with the latest and recommended Databricks deployment framework, Databricks Asset Bundles . DAB is actively supported and developed by the Databricks team as a new tool for streamlining the development of complex data, analytics, and ML projects for the Databricks platform. the previous blog Databricks CLI eXtensions Databricks Asset Bundles I will skip the general introduction of DAB and its features and refer you to the Databricks documentation. Here, I will focus on how to migrate our dbx project from the previous blog to DAB. Along the way, I will explain some concepts and features that can help you grasp each step better. Development pattern using Databricks GUI In the previous post, we used the Databricks GUI to develop and test our code and workflows. For this blog post, we want to be able to use our local environment to develop our code as well. The workflow will be as follows: Create a remote repository and clone it to our local environment and Databricks workspace. We use GitLab here. Develop the program logic and test it inside the Databricks GUI or on our local IDE. This includes Python scripts to build a Python Wheel package, scripts to test data quality using pytest, and a notebook to run the pytest. Push the code to GitLab. The git push will trigger a GitLab Runner to build, deploy, and launch resources on Databricks using Databricks Asset Bundles. Setting up your development environments Databricks CLI First of all, we need to install Databricks CLI version 0.205 or above on your local machine. To check your installed version of the Databricks CLI, run the command databricks -v. To install Databricks CLI version 0.205 or above, see Install or update the Databricks CLI. Authentication Databricks supports various authentication methods between the Databricks CLI on our development machine and your Databricks workspace. For this tutorial, we use Databricks personal access token authentication. It consists of two steps: Create a personal access token on our Databricks workspace. Create a Databricks configuration profile on our local machine. To generate a Databricks token in your Databricks workspace, go to User Settings → Developer → Access tokens → Manage → Generate new token. To create a configuration profile, create the file ~/.databrickscfg in your root folder with the following content: Create a remote repository and clone it to our local environment and Databricks workspace. We use GitLab here. Create a remote repository and clone it to our local environment and Databricks workspace. We use GitLab here. GitLab Develop the program logic and test it inside the Databricks GUI or on our local IDE. This includes Python scripts to build a Python Wheel package, scripts to test data quality using pytest, and a notebook to run the pytest. Develop the program logic and test it inside the Databricks GUI or on our local IDE. This includes Python scripts to build a Python Wheel package, scripts to test data quality using pytest, and a notebook to run the pytest. Push the code to GitLab. The git push will trigger a GitLab Runner to build, deploy, and launch resources on Databricks using Databricks Asset Bundles. Setting up your development environments Databricks CLI First of all, we need to install Databricks CLI version 0.205 or above on your local machine. To check your installed version of the Databricks CLI, run the command databricks -v. To install Databricks CLI version 0.205 or above, see Install or update the Databricks CLI. Authentication Databricks supports various authentication methods between the Databricks CLI on our development machine and your Databricks workspace. For this tutorial, we use Databricks personal access token authentication. It consists of two steps: Create a personal access token on our Databricks workspace. Create a Databricks configuration profile on our local machine. Push the code to GitLab. The git push will trigger a GitLab Runner to build, deploy, and launch resources on Databricks using Databricks Asset Bundles. git push Setting up your development environments Databricks CLI First of all, we need to install Databricks CLI version 0.205 or above on your local machine. To check your installed version of the Databricks CLI, run the command databricks -v . To install Databricks CLI version 0.205 or above, see Install or update the Databricks CLI . databricks -v Install or update the Databricks CLI Authentication Databricks supports various authentication methods between the Databricks CLI on our development machine and your Databricks workspace. For this tutorial, we use Databricks personal access token authentication. It consists of two steps: various authentication methods Create a personal access token on our Databricks workspace. Create a Databricks configuration profile on our local machine. Create a personal access token on our Databricks workspace. Create a Databricks configuration profile on our local machine. To generate a Databricks token in your Databricks workspace, go to User Settings → Developer → Access tokens → Manage → Generate new token. To generate a Databricks token in your Databricks workspace, go to User Settings → Developer → Access tokens → Manage → Generate new token. To create a configuration profile, create the file ~/.databrickscfg in your root folder with the following content: To create a configuration profile, create the file ~/.databrickscfg in your root folder with the following content: ~/.databrickscfg [asset-bundle-tutorial] host = https://xxxxxxxxxxx.cloud.databricks.com token = xxxxxxx [asset-bundle-tutorial] host = https://xxxxxxxxxxx.cloud.databricks.com token = xxxxxxx Here, the asset-bundle-tutorial is our profile name, the host is the address of our workspace, and the token is the personal access token that we just created. asset-bundle-tutorial You can create this file using the Databricks CLI by running databricks configure --profile asset-bundle-tutorial in your terminal. The command will prompt you for the Databricks Host and Personal Access Token . If you don’t specify the --profile flag, the profile name will be set to DEFAULT . databricks configure --profile asset-bundle-tutorial Databricks Host Personal Access Token --profile DEFAULT Git integration (Databricks) As the first step, we configure Git credentials & connect a remote repo to Databricks . Next, we create a remote repository and clone it to our Databricks repo , as well as on our local machine. Finally we need set up authentication between the Databricks CLI on the Gitlab runner and our Databricks workspace. To do that, we should add two environment variables, DATABRICKS_HOST and DATABRICKS_TOKEN to our Gitlab CI/CD pipeline configurations. For that open your repo in Gitlab, go to Settings→ CI/CD → Variables → Add variables configure Git credentials & connect a remote repo to Databricks clone it to our Databricks repo DATABRICKS_HOST DATABRICKS_TOKEN Settings→ CI/CD → Variables → Add variables Both dbx and DAB are built around the Databricks REST APIs , so at their core, they are very similar. I will go through the steps to create a bundle manually from our existing dbx project. Databricks REST APIs The first thing that we need to set up for our DAB project is the deployment configuration. In dbx, we use two files to define and set up our environments and workflows (jobs and pipelines). To set up the environment, we used .dbx/project.json , and to define the workflows, we used deployment.yml . dbx, we use two files .dbx/project.json deployment.yml In DAB, everything goes into databricks.yml , which is located in the root folder of your project. Here's how it looks: databricks.yml bundle: name: DAB_tutorial #our bundle name # These are for any custom variables for use throughout the bundle. variables: my_cluster_id: description: The ID of an existing cluster. default: xxxx-xxxxx-xxxxxxxx #The remote workspace URL and workspace authentication credentials are read from the caller’s local configuration profile named workspace: profile: asset-bundle-tutorial # These are the default job and pipeline settings if not otherwise overridden in # the following "targets" top-level mapping. resources: jobs: etl_job: tasks: - task_key: "main" existing_cluster_id: ${var.my_cluster_id} python_wheel_task: package_name: "my_package" entry_point: "etl_job" # take a look at the setup.py entry_points section for details on how to define an entrypoint libraries: - whl: ../dist/*.whl - task_key: "eda" existing_cluster_id: ${var.my_cluster_id} notebook_task: notebook_path: ../notebooks/explorative_analysis.py source: WORKSPACE depends_on: - task_key: "main" test_job: tasks: - task_key: "main_notebook" existing_cluster_id: ${var.my_cluster_id} notebook_task: notebook_path: ../notebooks/run_unit_test.py source: WORKSPACE libraries: - pypi: package: pytest # These are the targets to use for deployments and workflow runs. One and only one of these # targets can be set to "default: true". targets: # The 'dev' target, used for development purposes. # Whenever a developer deploys using 'dev', they get their own copy. dev: # We use 'mode: development' to make sure everything deployed to this target gets a prefix # like '[dev my_user_name]'. Setting this mode also disables any schedules and # automatic triggers for jobs and enables the 'development' mode for Delta Live Tables pipelines. mode: development default: true workspace: profile: asset-bundle-tutorial root_path: /Users/${workspace.current_user.userName}/.bundle/${bundle.name}/my-envs/${bundle.target} host: bundle: name: DAB_tutorial #our bundle name # These are for any custom variables for use throughout the bundle. variables: my_cluster_id: description: The ID of an existing cluster. default: xxxx-xxxxx-xxxxxxxx #The remote workspace URL and workspace authentication credentials are read from the caller’s local configuration profile named workspace: profile: asset-bundle-tutorial # These are the default job and pipeline settings if not otherwise overridden in # the following "targets" top-level mapping. resources: jobs: etl_job: tasks: - task_key: "main" existing_cluster_id: ${var.my_cluster_id} python_wheel_task: package_name: "my_package" entry_point: "etl_job" # take a look at the setup.py entry_points section for details on how to define an entrypoint libraries: - whl: ../dist/*.whl - task_key: "eda" existing_cluster_id: ${var.my_cluster_id} notebook_task: notebook_path: ../notebooks/explorative_analysis.py source: WORKSPACE depends_on: - task_key: "main" test_job: tasks: - task_key: "main_notebook" existing_cluster_id: ${var.my_cluster_id} notebook_task: notebook_path: ../notebooks/run_unit_test.py source: WORKSPACE libraries: - pypi: package: pytest # These are the targets to use for deployments and workflow runs. One and only one of these # targets can be set to "default: true". targets: # The 'dev' target, used for development purposes. # Whenever a developer deploys using 'dev', they get their own copy. dev: # We use 'mode: development' to make sure everything deployed to this target gets a prefix # like '[dev my_user_name]'. Setting this mode also disables any schedules and # automatic triggers for jobs and enables the 'development' mode for Delta Live Tables pipelines. mode: development default: true workspace: profile: asset-bundle-tutorial root_path: /Users/${workspace.current_user.userName}/.bundle/${bundle.name}/my-envs/${bundle.target} host: The databricks.yml bundle configuration file consists of sections called mappings. These mappings allow us to modularize the configuration file into separate logical blocks. There are 8 top-level mappings: databricks.yml bundle variables workspace artifacts include resources sync targets bundle bundle variables variables workspace workspace artifacts artifacts include include resources resources sync sync targets targets Here, we use five of these mappings to organize our project. bundle : bundle In the bundle mapping, we define the name of the bundle. Here, we can also define a default cluster ID that should be used for our development environments, as well as information about the Git URL and branch. bundle variables : variables We can use the variables mapping to define custom variables and make our configuration file more reusable. For example, we declare a variable for the ID of an existing cluster and use it in different workflows. Now, in case you want to use a different cluster, all you have to do is to change the variable value. variables resources : resources The resources mapping is where we define our workflows. It includes zero or one of each of the following mappings: experiments , jobs , models , and pipelines . This is basically our deployment.yml file in the dbx project. Though there are some minor differences: resources experiments jobs models pipelines deployment.yml For the python_wheel_task, we must include the path to our wheel package; otherwise, Databricks can’t find the library. You can find more info about building wheel packages using DAB here. We can use relative paths instead of full paths to run the notebook tasks. The path for the notebook to deploy is relative to the databricks.yml file in which this task is declared. For the python_wheel_task , we must include the path to our wheel package; otherwise, Databricks can’t find the library. You can find more info about building wheel packages using DAB here . python_wheel_task here We can use relative paths instead of full paths to run the notebook tasks. The path for the notebook to deploy is relative to the databricks.yml file in which this task is declared. databricks.yml targets : targets The targets mapping is where we define the configurations and resources of different stages/environments of our projects. For example, for a typical CI/CD pipeline, we would have three targets: development, staging, and production. Each target can consist of all the top-level mappings (except targets ) as child mappings. Here is the schema of the target mapping ( databricks.yml ). targets targets databricks.yml targets: : artifacts: ... bundle: ... compute_id: string default: true | false mode: development resources: ... sync: ... variables: : workspace: ... targets: : artifacts: ... bundle: ... compute_id: string default: true | false mode: development resources: ... sync: ... variables: : workspace: ... The child mapping allows us to override the default configurations that we defined earlier in the top-level mappings. For example, if we want to have an isolated Databricks workspace for each stage of our CI/CD pipeline, we should set the workspace child mapping for each target. workspace: profile: my-default-profile targets: dev: default: true test: workspace: host: https:// prod: workspace: host: https:// workspace: profile: my-default-profile targets: dev: default: true test: workspace: host: https:// prod: workspace: host: https:// include: include: The include mapping allows us to break our configuration file into different modules. For example, we can save our resources and variables to the resources/dev_job.yml file and import it into our databricks.yml file. include resources/dev_job.yml databricks.yml # yaml-language-server: $schema=bundle_config_schema.json bundle: name: DAB_tutorial #our bundle name workspace: profile: asset-bundle-tutorial include: - ./resources/*.yml targets: # The 'dev' target, used for development purposes. # Whenever a developer deploys using 'dev', they get their own copy. dev: # We use 'mode: development' to make sure everything deployed to this target gets a prefix # like '[dev my_user_name]'. Setting this mode also disables any schedules and # automatic triggers for jobs and enables the 'development' mode for Delta Live Tables pipelines. mode: development default: true # yaml-language-server: $schema=bundle_config_schema.json bundle: name: DAB_tutorial #our bundle name workspace: profile: asset-bundle-tutorial include: - ./resources/*.yml targets: # The 'dev' target, used for development purposes. # Whenever a developer deploys using 'dev', they get their own copy. dev: # We use 'mode: development' to make sure everything deployed to this target gets a prefix # like '[dev my_user_name]'. Setting this mode also disables any schedules and # automatic triggers for jobs and enables the 'development' mode for Delta Live Tables pipelines. mode: development default: true For more detailed explanation of DAB configurations check out Databricks Asset Bundle configurations Databricks Asset Bundle configurations Workflows The workflows are exactly what I described in previous blog. The only differences is the location of artifacts and files. The project skeleton The project skeleton here is how the final project looks like ASSET-BUNDLE-TUTORAL/ ├─ my_package/ │ ├─ tasks/ │ │ ├─ __init__.py │ │ ├─ sample_etl_job.py │ ├─ __init__.py │ ├─ common.py ├─ test/ │ ├─ conftest.py │ ├─ test_sample.py ├─ notebooks/ │ ├─ explorative_analysis.py │ ├─ run_unit_test.py ├─ resources/ │ ├─ dev_jobs.yml ├─ .gitignore ├─ .gitlab-ci.yml ├─ databricks.yml ├─ README.md ├─ setup.py ASSET-BUNDLE-TUTORAL/ ├─ my_package/ │ ├─ tasks/ │ │ ├─ __init__.py │ │ ├─ sample_etl_job.py │ ├─ __init__.py │ ├─ common.py ├─ test/ │ ├─ conftest.py │ ├─ test_sample.py ├─ notebooks/ │ ├─ explorative_analysis.py │ ├─ run_unit_test.py ├─ resources/ │ ├─ dev_jobs.yml ├─ .gitignore ├─ .gitlab-ci.yml ├─ databricks.yml ├─ README.md ├─ setup.py Validate, Deploy & Run Now, open your terminal and run the following commands from the root directory: validate: First, we should check if our configuration file has the right format and syntax. If the validation succeeds, you will get a JSON representation of the bundle configuration. In case of an error, fix it and run the command again until you receive the JSON file. databricks bundle validate validate: First, we should check if our configuration file has the right format and syntax. If the validation succeeds, you will get a JSON representation of the bundle configuration. In case of an error, fix it and run the command again until you receive the JSON file. databricks bundle validate validate: First, we should check if our configuration file has the right format and syntax. If the validation succeeds, you will get a JSON representation of the bundle configuration. In case of an error, fix it and run the command again until you receive the JSON file. validate: databricks bundle validate databricks bundle validate deploy: Deployment includes building the Python wheel package and deploying it to our Databricks workspace, deploying the notebooks and other files to our Databricks workspace, and creating the jobs in our Databricks workflows. databricks bundle deploy If no command options are specified, the Databricks CLI uses the default target as declared within the bundle configuration files. Here, we only have one target so it doesn’t matter, but to demonstrate this, we can also deploy a specific target by using the -t dev flag. run: Run the deployed jobs. Here, we can specify which job we want to run. For example, in the following command, we run the test_job job in the dev target. databricks bundle run -t dev test_job deploy: Deployment includes building the Python wheel package and deploying it to our Databricks workspace, deploying the notebooks and other files to our Databricks workspace, and creating the jobs in our Databricks workflows. databricks bundle deploy If no command options are specified, the Databricks CLI uses the default target as declared within the bundle configuration files. Here, we only have one target so it doesn’t matter, but to demonstrate this, we can also deploy a specific target by using the -t dev flag. deploy: Deployment includes building the Python wheel package and deploying it to our Databricks workspace, deploying the notebooks and other files to our Databricks workspace, and creating the jobs in our Databricks workflows. deploy: databricks bundle deploy databricks bundle deploy If no command options are specified, the Databricks CLI uses the default target as declared within the bundle configuration files. Here, we only have one target so it doesn’t matter, but to demonstrate this, we can also deploy a specific target by using the -t dev flag. -t dev run: Run the deployed jobs. Here, we can specify which job we want to run. For example, in the following command, we run the test_job job in the dev target. databricks bundle run -t dev test_job run: Run the deployed jobs. Here, we can specify which job we want to run. For example, in the following command, we run the test_job job in the dev target. run: test_job databricks bundle run -t dev test_job databricks bundle run -t dev test_job in the output you get a URL to that points to the job run in your workspace. you can also find your jobs in he Workflow section of your Databricks workspace. CI pipeline configuration configuration The general setup of our CI pipeline stays the same as the previous project. It consists of two main stages: test and deploy . In the test stage, the unit-test-job runs the unit tests and deploys a separate workflow for testing. The deploy stage, activated upon successful completion of the test stage, handles the deployment of your main ETL workflow. test deploy test unit-test-job deploy Here, we have to add additional steps before each stage for installing Databricks CLI and setting up the authentication profile. We do this in the before_script section of our CI pipeline. The before_script keyword is used to define an array of commands that should run before each job’s script commands. More about it can be found here . before_script before_script script here Optionally, you can use the after_project keyword to define an array of commands that should run AFTER each job. Here, we can use databricks bundle destroy --auto-approve to clean up after each job is over. In general, our pipeline go through these steps: after_project databricks bundle destroy --auto-approve Install the Databricks CLI and create configuration profile. Build the project. Push the build artifacts to the Databricks workspace. Install the wheel package on your cluster. Create the jobs on Databricks Workflows. Run the jobs. Install the Databricks CLI and create configuration profile. Build the project. Push the build artifacts to the Databricks workspace. Install the wheel package on your cluster. Create the jobs on Databricks Workflows. Run the jobs. here is how our .gitlab-ci.yml looks like: .gitlab-ci.yml image: python:3.9 stages: # List of stages for jobs, and their order of execution - test - deploy default: before_script: - echo "install databricks cli" - curl -V - curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh - echo "databricks CLI installation finished" - echo "create the configuration profile for token authentication" - echo "[asset-bundle-tutorial]" > ~/.databrickscfg - echo "token = $DATABRICKS_TOKEN" >> ~/.databrickscfg - echo "host = $DATABRICKS_HOST" >> ~/.databrickscfg - echo "validate the bundle" - databricks bundle validate after_script: - echo "remove all workflows" #- databricks bundle destroy --auto-approve unit-test-job: # This job runs in the test stage. stage: test script: - echo "Running unit tests." - pip3 install --upgrade wheel setuptools - pip install -e ".[local]" - databricks bundle deploy -t dev - databricks bundle run -t dev test_job deploy-job: # This job runs in the deploy stage. stage: deploy # It only runs when *both* jobs in the test stage complete successfully. script: - echo "Deploying application..." - echo "Install dependencies" - pip install -e ".[local]" - echo "Deploying Job" - databricks bundle deploy -t dev - databricks bundle run -t dev etl_job image: python:3.9 stages: # List of stages for jobs, and their order of execution - test - deploy default: before_script: - echo "install databricks cli" - curl -V - curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh - echo "databricks CLI installation finished" - echo "create the configuration profile for token authentication" - echo "[asset-bundle-tutorial]" > ~/.databrickscfg - echo "token = $DATABRICKS_TOKEN" >> ~/.databrickscfg - echo "host = $DATABRICKS_HOST" >> ~/.databrickscfg - echo "validate the bundle" - databricks bundle validate after_script: - echo "remove all workflows" #- databricks bundle destroy --auto-approve unit-test-job: # This job runs in the test stage. stage: test script: - echo "Running unit tests." - pip3 install --upgrade wheel setuptools - pip install -e ".[local]" - databricks bundle deploy -t dev - databricks bundle run -t dev test_job deploy-job: # This job runs in the deploy stage. stage: deploy # It only runs when *both* jobs in the test stage complete successfully. script: - echo "Deploying application..." - echo "Install dependencies" - pip install -e ".[local]" - echo "Deploying Job" - databricks bundle deploy -t dev - databricks bundle run -t dev etl_job Notes Here are some notes that could help you set up your bundle project: In this blog, we created our bundle manually. In my experience, this helps to understand the underlying concepts and features better. But if you want to have a fast start with your project, you can use default and non-default bundle templates that are provided by Databricks or other parties. Check out this Databricks post to learn about how to initiate a project with the default Python template. When you deploy your code using databricks bundle deploy, Databricks CLI runs the command python3 setup.py bdist_wheel to build your package using the setup.py file. If you already have python3 installed but your machine uses the python alias instead of python3, you will run into problems. However, this is easy to fix. For example, here and here are two Stack Overflow threads with some solutions. In this blog, we created our bundle manually. In my experience, this helps to understand the underlying concepts and features better. But if you want to have a fast start with your project, you can use default and non-default bundle templates that are provided by Databricks or other parties. Check out this Databricks post to learn about how to initiate a project with the default Python template. this Databricks When you deploy your code using databricks bundle deploy , Databricks CLI runs the command python3 setup.py bdist_wheel to build your package using the setup.py file. If you already have python3 installed but your machine uses the python alias instead of python3 , you will run into problems. However, this is easy to fix. For example, here and here are two Stack Overflow threads with some solutions. databricks bundle deploy python3 setup.py bdist_wheel setup.py python3 python python3 here here What’s next In the next blog post, I will start with my first blog post on how to start a machine learning project on Databricks. It will be the first post in my upcoming end-to-end machine learning pipeline, covering everything from development to production. Stay tuned! Resouces repository for this tutorial. repository for this tutorial . repository for this tutorial Make sure you update the cluster_id in resources/dev_jobs.yml Make sure you update the cluster_id in resources/dev_jobs.yml resources/dev_jobs.yml Migrate from dbx to bundles | Databricks on AWS Databricks Asset Bundles development work tasks | Databricks on AWS Databricks Asset Bundle deployment modes | Databricks on AWS Develop a Python wheel by using Databricks Asset Bundles | Databricks on AWS Databricks Asset Bundles: A Standard, Unified Approach to Deploying Data Products on Databricks (youtube.com) repo and slides https://github.com/databricks/databricks-asset-bundles-dais2023 Migrate from dbx to bundles | Databricks on AWS Migrate from dbx to bundles | Databricks on AWS Databricks Asset Bundles development work tasks | Databricks on AWS Databricks Asset Bundles development work tasks | Databricks on AWS Databricks Asset Bundle deployment modes | Databricks on AWS Databricks Asset Bundle deployment modes | Databricks on AWS Develop a Python wheel by using Databricks Asset Bundles | Databricks on AWS Develop a Python wheel by using Databricks Asset Bundles | Databricks on AWS Databricks Asset Bundles: A Standard, Unified Approach to Deploying Data Products on Databricks (youtube.com) repo and slides https://github.com/databricks/databricks-asset-bundles-dais2023 Databricks Asset Bundles: A Standard, Unified Approach to Deploying Data Products on Databricks (youtube.com) repo and slides https://github.com/databricks/databricks-asset-bundles-dais2023 repo and slides https://github.com/databricks/databricks-asset-bundles-dais2023 https://github.com/databricks/databricks-asset-bundles-dais2023