Build Smart Solutions using Big Data Stack on Microsoft Azure Platform – Azure Data Factory (Part 2)

In the previous post, I wrote about creating ADF account to do the transformations. Now, let’s learn about how to create pipelines with ADF. I will explain the functionality of the ADF in this post. Moreover, I will share a link where you can find the demo to create a pipeline.

There are four major components of ADF:

1. Linked Services
2. Data Sets
3. Activities
4. Pipelines
5. Data Gateways

image

Linked Services : – Linked Services are used to link resources to the ADF e.g. if we need to process perfmon file, we first need to create linked services for the storage where it will pick the file from. Another Linked service for the compute which will process the file and so on.

Storage : – As shown below, there are various options to pick the files from: We need to select the source of the data and destination of the data. e.g. if I need to pick the data from Azure Storage where the files are kept, I will pick Azure storage for the data store. Moreover, if the transformed data has to be stored in SQL server or DocumentDB – I will pick the respective options from the same list:

image

Compute- Once Source and destination data store is selected , we need to select the compute now. For compute, we need to select:

image

In this example, I want to process the file using ADLA – USQL Queries so, I will choose Azure Data Lake Analytics. However, if I had to process the file using hive or pig queries – I would have chosen HDInsight or On demand HDInsight cluster.

Data Sets – It’s the same concept like Reporting Service reports/SSIS package. We first choose the data source which may anything like SQL Server Database or MS Access DB. Similarly,here once the source and destination is selected, we need to choose the data set which can be file/folder or table/collection.

In out current scenario, the files exist on Azure Blog storage and we have already created a linked service for the Blob storage. Therefore, the data set will also be for Azure Blob Storage:

image

In this Data Set you will mention the actual file name or folder you want to pick.

Activities – Activities define actions to be performed on the data.  The relationship between all the components is as follows

image

Pipeline- Once the source and destination of the file is decided, file to work on is selected and compute is decided. It’s the time to create a pipeline with the set of activities for the transformation:
image

Data GatewayIf you want to move data to and fro on-premise to Azure, then you use Data gateway.

image

Link to demo which you can try yourself – https://azure.microsoft.com/en-us/documentation/articles/data-factory-usql-activity/
https://azure.microsoft.com/en-us/blog/azure-data-factory-updates-execute-adf-custom-net-activities-using-azure-batch/

References – https://azure.microsoft.com/en-us/documentation/articles/data-factory-introduction/

Advertisements

Build Smart Solutions using Big Data Stack on Microsoft Azure Platform – Azure Data Factory (Part 1)

In the previous posts, I wrote about Azure Data Lake Store (ADLS) and Azure Data Lake Analytics (ADLA). To make you understand better, let’s say ADLS works as a SAN and ADLA works as compute (server with RAM and CPUs). They make a great server machine together. Now, what is Azure Data Factory ?

Azure Data Factory(ADF)  is a framework that’s used for data transformation on Azure. Like, we have SSIS service for On-premise SQL Server similarly, ADF is a transformation service for Azure data platform services – primarily. Let’s take the same example for perfmon analysis, we need to process the perfmon logs for 500,000 machines on daily basis.

1. The data has to be ingested into the system  – Azure Data Lake Store
2. The data has to be cleaned  – Azure Data Lake Analytics / Hive Queries / Pig Queries
3. The data has to be transformed for the reporting/aggregation – Azure Data Lake Analytics / Hive Queries / Pig Queries/ Machine Learning model
4.  The data has to be inserted to a destination for reporting – Azure SQL DW or SQL Azure DB

How to run all these steps in sequence and on regular intervals? ADF is the solution for all that. To use ADF, we need to create ADF account first:

image

 

Once you click create, you will this screen:

image

 

 

This is the dashboard which we will use to create the transformation using ADF.  In the next post, I will write about how to create pipeline for transformation.

HTH!