In this post, we will look at creating an Azure data factory with a pipeline that loads Office 365 event data incrementally based on change data capture (CDC) information in the source of Change Data Feed(CDF) of a Delta lake table to an AWS S3 bucket. What we’ll cover: Create an ADF Pipeline that loads Calendar events from Offfice365 to a Blob container. Run a Databricks Notebook with the activity in the ADF pipeline, transform extracted Calendar event and merge to a Delta lake table. Load Change Data Feed on the Delta lake table to an AWS S3 bucket. The following diagram shows the architecture of the above steps: Prerequisites To build the pipeline, there are a few prerequisites that need to be met: Setup an Azure account. Register an AD application with Microsoft Graph permission. . Configure Microsoft Graph data connect consent request approver group t. Create an Azure Storage Accoun Create an AWS S3 bucket. Before going any further, we will need to create an Azure Storage account and grant the Azure AD application we registered the proper access to the container of the storage account. Check out the step-by-step Azure for Authorizing access to blobs using Azure Active Directory. guide 1. Create an Azure Data Factory Pipeline In this step, we will create an ADF Pipeline that loads Office365 Calendar Views data to an Azure Blob container. I will assume you already have an Azure Subscription and Resource Group setup. To create an Azure data factory, run the following command: az datafactory create --resource- --factory-name ADFDemo group RESOURCE_GROUP 1.1 Create linked services and datasets To connect to Office 365, we need to use the following command to create a linked service for Office 365 connection: az datafactory linked - service create - - resource - group RESOURCE_GROUP - - factory - name ADFDemo - - linked - service - name Offic365LinkedService - - properties @offic365LinkedService . json Replace APP_CLIENT_ID , TENANT_ID and APP_CLIENT_SECRET with corresponding values from your Active Directory tenant and App. Next, create a Calendar dataset named calendarDataset to reference the Calendar view data from Office 365. az datafactory dataset create - - resource - group RESOURCE_GROUP - - dataset - name calendarDataset - - factory - name ADFDemo - - properties @calendarDataset . json Now, let’s create another Linked Services for Blob Storage Account to store calendar data extracted from Office 365. az datafactory linked - service create - - resource - group RESOURCE_GROUP - - factory - name ADFDemo - - linked - service - name AzureStorageLinkedService - - properties @azureStorageLinkedService . json ECRET Replace YOUR_CONTAINER , APP_CLIENT_S , TENANT_ID and APP_CLIENT_SECRET with corresponding values from your Active Directory tenant, App, and storage account. Next, create a binary dataset named . outputDataset az datafactory dataset create - - resource - group RESOURCE_GROUP - - dataset - name outputDataset - - factory - name ADFDemo - - properties @outputDataset . json Note that @dataset.outputPath in the folderPath section is a global variable that we will set in the next step. 1.2 Create an ADF pipeline In this section, we will use the Azure CLI command to create a simple ADF pipeline with three activities: to set a global variable . SetVariableActivity outputPath named to load data from Office 365 to Azure blob storage. CopyActivity ExtractCalendarView to throw an error when the pipeline fails. FailActivity Enter the following commands to create ADF Pipeline with activities: az datafactory pipeline create --resource-group RESOURCE_GROUP --factory-name ADFDemo --name DemoPipeline  --pipeline @demoPipeline.json Once the pipeline is created, Navigate to the Azure portal and launch the Azure Data Factory studio. We can view the pipeline activities we have created from the pipeline designer. 2. Create Notebook in Azure Databricks In this section, we will create a notebook in Azure Databricks. You can refer to this to create a notebook in the Azure Databricks workspace. tutorial Before starting the notebook, we have to set up a connection from Azure Databricks to Azure Storage Account using a SAS key. Check out this for creating a Databricks secret scope. tutorial We are going to perform the following steps in the notebook: Mount a Blob storage container. Transform calendar view data. Create an SCD Type 1 delta table with Change Data Feed enabled. Merge input data frame to the delta table. Configure AWS S3 connection, and load change feed data to s3 bucket. The notebook would look similar to this: Now we have our notebook created, Let’s move back to step 1.1 to create linked services for Azure Databricks connection. az datafactory linked - service create - - resource - group RESOURCE_GROUP - - factory - name ADFDemo - - linked - service - name databricksLinkedService - - properties @databricksLinkedService . json Next, delete the pipeline that we created in step 1.2, recreate it to execute the Databricks notebook against the Databricks jobs cluster. Delete the existing pipeline we have created: az datafactory pipeline delete - - factory - name - - name ADFDemo - - resource - group RESOURCE_GROUP then recreate the pipeline using the following command: az datafactory pipeline create - - resource - group RESOURCE_GROUP - - factory - name ADFDemo - - name DemoPipeline - - pipeline @demoPipeline . json Now we should have our pipeline that looks like the following diagram. Let’s publish entities (linked services and pipeline) to the Azure Data Factory service using ADF studio. and trigger the pipeline. We can log on to the Azure Databricks workspace, go to Clusters and see the Job status as . pending execution, running, or terminated After the pipeline run is completed, log on to the AWS S3 console, we should see the CSV file in the following format is loaded to the destination s3 bucket. Conclusion In this article, I introduced the approach to load data incrementally from Office365 to an AWS S3 bucket using ADF and Azure Databricks. I hope you found this article useful. You can find the ARM template and notebook in my . GitHub repository

Microsoft

How I Built a Data Analysis Assistant with BigQuery and Langchain

How To Integrate Amazon Cognito with Ethereum Blockchain: A Step-by-Step Guide

Building an ETL Pipeline to Load Data Incrementally from Office365 to S3 using ADF and Databricks

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

An easy way to manage Serverless project resources by using AWS Resource Groups

104 Stories To Learn About Go

105 Stories To Learn About Functional Programming

100+ Free Pluralsight Courses to learn Python, Java, and Spring Boot

10 Websites to Learn JavaScript for Beginners

104 Stories To Learn About Programming Top Story

An easy way to manage Serverless project resources by using AWS Resource Groups

104 Stories To Learn About Go

105 Stories To Learn About Functional Programming

100+ Free Pluralsight Courses to learn Python, Java, and Spring Boot

10 Websites to Learn JavaScript for Beginners

104 Stories To Learn About Programming Top Story

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps