Building an ETL Pipeline to Load Data Incrementally from Office365 to S3 using ADF and Databricks

Written by yi | Published 2021/11/19
Tech Story Tags: databricks | delta-lake | data-factory | data-pipeline | pyspark | coding | hackernoon-top-story | tutorial

TLDRIn this post, we will look at creating an Azure data factory with a pipeline that loads Office 365 event data incrementally based on change data capture (CDC) information in the source of Change Data Feed(CDF) of a Delta lake table to an AWS S3 bucket. What we’ll cover: Create an ADF Pipeline that loads Calendar events from Offfice365 to a Blob container. Run a Databricks Notebook with the activity in the ADF pipeline, transform extracted Calendar event and merge to a Delta Lake table.via the TL;DR App

In this post, we will look at creating an Azure data factory with a pipeline that loads Office 365 event data incrementally based on change data capture (CDC) information in the source of Change Data Feed(CDF) of a Delta lake table to an AWS S3 bucket.

What we’ll cover:

  • Create an ADF Pipeline that loads Calendar events from Offfice365 to a Blob container.
  • Run a Databricks Notebook with the activity in the ADF pipeline, transform extracted Calendar event and merge to a Delta lake table.
  • Load Change Data Feed on the Delta lake table to an AWS S3 bucket.
The following diagram shows the architecture of the above steps:

Prerequisites

To build the pipeline, there are a few prerequisites that need to be met:
Before going any further, we will need to create an Azure Storage account and grant the Azure AD application we registered the proper access to the container of the storage account. Check out the step-by-step Azure guide for Authorizing access to blobs using Azure Active Directory.

1. Create an Azure Data Factory Pipeline

In this step, we will create an ADF Pipeline that loads Office365 Calendar Views data to an Azure Blob container.
I will assume you already have an Azure Subscription and Resource Group setup.
To create an Azure data factory, run the following command:
az datafactory create --resource-group RESOURCE_GROUP --factory-name ADFDemo

1.1 Create linked services and datasets

To connect to Office 365, we need to use the following command to create a linked service for Office 365 connection:
az datafactory linked-service create --resource-group RESOURCE_GROUP  --factory-name ADFDemo --linked-service-name Offic365LinkedService --properties @offic365LinkedService.json
Replace 
APP_CLIENT_ID
 , 
TENANT_ID
 and 
APP_CLIENT_SECRET
 with corresponding values from your Active Directory tenant and App.
Next, create a Calendar dataset named calendarDataset to reference the Calendar view data from Office 365.
az datafactory dataset create --resource-group RESOURCE_GROUP --dataset-name calendarDataset --factory-name ADFDemo --properties @calendarDataset.json
Now, let’s create another Linked Services for Blob Storage Account to store calendar data extracted from Office 365.
az datafactory linked-service create --resource-group RESOURCE_GROUP  --factory-name ADFDemo --linked-service-name AzureStorageLinkedService --properties @azureStorageLinkedService.json
Replace 
YOUR_CONTAINER
 , 
APP_CLIENT_S
ECRET
TENANT_ID
 and 
APP_CLIENT_SECRET
 with corresponding values from your Active Directory tenant, App, and storage account.
Next, create a binary dataset named 
outputDataset
.
az datafactory dataset create --resource-group RESOURCE_GROUP --dataset-name outputDataset --factory-name ADFDemo --properties @outputDataset.json
Note that @dataset.outputPath in the folderPath section is a global variable that we will set in the next step.
1.2 Create an ADF pipeline
In this section, we will use the Azure CLI command to create a simple ADF pipeline with three activities:
  • SetVariableActivity
     to set a global variable 
    outputPath
     .
  • CopyActivity
     named 
    ExtractCalendarView
     to load data from Office 365 to Azure blob storage.
  • FailActivity
     to throw an error when the pipeline fails.
Enter the following commands to create ADF Pipeline with activities:
az datafactory pipeline create --resource-group RESOURCE_GROUP --factory-name ADFDemo --name DemoPipeline  --pipeline @demoPipeline.json
Once the pipeline is created, Navigate to the Azure portal and launch the Azure Data Factory studio. We can view the pipeline activities we have created from the pipeline designer.

2. Create Notebook in Azure Databricks

In this section, we will create a notebook in Azure Databricks. You can refer to this tutorial to create a notebook in the Azure Databricks workspace.
Before starting the notebook, we have to set up a connection from Azure Databricks to Azure Storage Account using a SAS key. Check out this tutorial for creating a Databricks secret scope.
We are going to perform the following steps in the notebook:
  • Mount a Blob storage container.
  • Transform calendar view data.
  • Create an SCD Type 1 delta table with Change Data Feed enabled.
  • Merge input data frame to the delta table.
  • Configure AWS S3 connection, and load change feed data to s3 bucket.
The notebook would look similar to this:
Now we have our notebook created, Let’s move back to step 1.1 to create linked services for Azure Databricks connection.
az datafactory linked-service create --resource-group RESOURCE_GROUP  --factory-name ADFDemo --linked-service-name databricksLinkedService --properties @databricksLinkedService.json
Next, delete the pipeline that we created in step 1.2, recreate it to execute the Databricks notebook against the Databricks jobs cluster.
Delete the existing pipeline we have created:
az datafactory pipeline delete --factory-name                                --name ADFDemo --resource-group RESOURCE_GROUP
then recreate the pipeline using the following command:
az datafactory pipeline create --resource-group RESOURCE_GROUP --factory-name ADFDemo --name DemoPipeline  --pipeline @demoPipeline.json
Now we should have our pipeline that looks like the following diagram.
Let’s publish entities (linked services and pipeline) to the Azure Data Factory service using ADF studio.
and trigger the pipeline.
We can log on to the Azure Databricks workspace, go to Clusters and see the Job status as pending execution, running, or terminated.
After the pipeline run is completed, log on to the AWS S3 console, we should see the CSV file in the following format is loaded to the destination s3 bucket.

Conclusion

In this article, I introduced the approach to load data incrementally from Office365 to an AWS S3 bucket using ADF and Azure Databricks. I hope you found this article useful.
You can find the ARM template and notebook in my GitHub repository.

Published by HackerNoon on 2021/11/19