In this post, we will look at creating an Azure data factory with a pipeline that loads Office 365 event data incrementally based on change data capture (CDC) information in the source of Change Data Feed(CDF) of a Delta lake table to an AWS S3 bucket.
The following diagram shows the architecture of the above steps:
To build the pipeline, there are a few prerequisites that need to be met:
Before going any further, we will need to create an Azure Storage account and grant the Azure AD application we registered the proper access to the container of the storage account. Check out the step-by-step Azure guide for Authorizing access to blobs using Azure Active Directory.
In this step, we will create an ADF Pipeline that loads Office365 Calendar Views data to an Azure Blob container.
I will assume you already have an Azure Subscription and Resource Group setup.
To create an Azure data factory, run the following command:
az datafactory create --resource-group RESOURCE_GROUP --factory-name ADFDemo
To connect to Office 365, we need to use the following command to create a linked service for Office 365 connection:
az datafactory linked-service create --resource-group RESOURCE_GROUP --factory-name ADFDemo --linked-service-name Offic365LinkedService --properties @offic365LinkedService.json
Replace
APP_CLIENT_ID
, TENANT_ID
and APP_CLIENT_SECRET
with corresponding values from your Active Directory tenant and App.Next, create a Calendar dataset named calendarDataset to reference the Calendar view data from Office 365.
az datafactory dataset create --resource-group RESOURCE_GROUP --dataset-name calendarDataset --factory-name ADFDemo --properties @calendarDataset.json
Now, let’s create another Linked Services for Blob Storage Account to store calendar data extracted from Office 365.
az datafactory linked-service create --resource-group RESOURCE_GROUP --factory-name ADFDemo --linked-service-name AzureStorageLinkedService --properties @azureStorageLinkedService.json
Replace
YOUR_CONTAINER
, APP_CLIENT_S
ECRET, TENANT_ID
and APP_CLIENT_SECRET
with corresponding values from your Active Directory tenant, App, and storage account.Next, create a binary dataset named
outputDataset
.az datafactory dataset create --resource-group RESOURCE_GROUP --dataset-name outputDataset --factory-name ADFDemo --properties @outputDataset.json
Note that @dataset.outputPath in the folderPath section is a global variable that we will set in the next step.
1.2 Create an ADF pipeline
In this section, we will use the Azure CLI command to create a simple ADF pipeline with three activities:
SetVariableActivity
to set a global variable outputPath
.CopyActivity
named ExtractCalendarView
to load data from Office 365 to Azure blob storage.FailActivity
to throw an error when the pipeline fails.Enter the following commands to create ADF Pipeline with activities:
az datafactory pipeline create --resource-group RESOURCE_GROUP --factory-name ADFDemo --name DemoPipeline --pipeline @demoPipeline.json
Once the pipeline is created, Navigate to the Azure portal and launch the Azure Data Factory studio. We can view the pipeline activities we have created from the pipeline designer.
In this section, we will create a notebook in Azure Databricks. You can refer to this tutorial to create a notebook in the Azure Databricks workspace.
Before starting the notebook, we have to set up a connection from Azure Databricks to Azure Storage Account using a SAS key. Check out this tutorial for creating a Databricks secret scope.
We are going to perform the following steps in the notebook:
The notebook would look similar to this:
Now we have our notebook created, Let’s move back to step 1.1 to create linked services for Azure Databricks connection.
az datafactory linked-service create --resource-group RESOURCE_GROUP --factory-name ADFDemo --linked-service-name databricksLinkedService --properties @databricksLinkedService.json
Next, delete the pipeline that we created in step 1.2, recreate it to execute the Databricks notebook against the Databricks jobs cluster.
Delete the existing pipeline we have created:
az datafactory pipeline delete --factory-name --name ADFDemo --resource-group RESOURCE_GROUP
then recreate the pipeline using the following command:
az datafactory pipeline create --resource-group RESOURCE_GROUP --factory-name ADFDemo --name DemoPipeline --pipeline @demoPipeline.json
Now we should have our pipeline that looks like the following diagram.
Let’s publish entities (linked services and pipeline) to the Azure Data Factory service using ADF studio.
and trigger the pipeline.
We can log on to the Azure Databricks workspace, go to Clusters and see the Job status as pending execution, running, or terminated.
After the pipeline run is completed, log on to the AWS S3 console, we should see the CSV file in the following format is loaded to the destination s3 bucket.
In this article, I introduced the approach to load data incrementally from Office365 to an AWS S3 bucket using ADF and Azure Databricks. I hope you found this article useful.
You can find the ARM template and notebook in my GitHub repository.