Build Smart Solutions using Big Data Stack on Microsoft Azure Platform – Azure Data Factory (Part 2)

In the previous post, I wrote about creating ADF account to do the transformations. Now, let’s learn about how to create pipelines with ADF. I will explain the functionality of the ADF in this post. Moreover, I will share a link where you can find the demo to create a pipeline.

There are four major components of ADF:

1. Linked Services
2. Data Sets
3. Activities
4. Pipelines
5. Data Gateways

image

Linked Services : – Linked Services are used to link resources to the ADF e.g. if we need to process perfmon file, we first need to create linked services for the storage where it will pick the file from. Another Linked service for the compute which will process the file and so on.

Storage : – As shown below, there are various options to pick the files from: We need to select the source of the data and destination of the data. e.g. if I need to pick the data from Azure Storage where the files are kept, I will pick Azure storage for the data store. Moreover, if the transformed data has to be stored in SQL server or DocumentDB – I will pick the respective options from the same list:

image

Compute- Once Source and destination data store is selected , we need to select the compute now. For compute, we need to select:

image

In this example, I want to process the file using ADLA – USQL Queries so, I will choose Azure Data Lake Analytics. However, if I had to process the file using hive or pig queries – I would have chosen HDInsight or On demand HDInsight cluster.

Data Sets – It’s the same concept like Reporting Service reports/SSIS package. We first choose the data source which may anything like SQL Server Database or MS Access DB. Similarly,here once the source and destination is selected, we need to choose the data set which can be file/folder or table/collection.

In out current scenario, the files exist on Azure Blog storage and we have already created a linked service for the Blob storage. Therefore, the data set will also be for Azure Blob Storage:

image

In this Data Set you will mention the actual file name or folder you want to pick.

Activities – Activities define actions to be performed on the data.  The relationship between all the components is as follows

image

Pipeline- Once the source and destination of the file is decided, file to work on is selected and compute is decided. It’s the time to create a pipeline with the set of activities for the transformation:
image

Data GatewayIf you want to move data to and fro on-premise to Azure, then you use Data gateway.

image

Link to demo which you can try yourself – https://azure.microsoft.com/en-us/documentation/articles/data-factory-usql-activity/
https://azure.microsoft.com/en-us/blog/azure-data-factory-updates-execute-adf-custom-net-activities-using-azure-batch/

References – https://azure.microsoft.com/en-us/documentation/articles/data-factory-introduction/

Advertisements