Build Smart Solutions using Big Data Stack on Microsoft Azure Platform – Azure Data Factory (Part 1)

In the previous posts, I wrote about Azure Data Lake Store (ADLS) and Azure Data Lake Analytics (ADLA). To make you understand better, let’s say ADLS works as a SAN and ADLA works as compute (server with RAM and CPUs). They make a great server machine together. Now, what is Azure Data Factory ?

Azure Data Factory(ADF)  is a framework that’s used for data transformation on Azure. Like, we have SSIS service for On-premise SQL Server similarly, ADF is a transformation service for Azure data platform services – primarily. Let’s take the same example for perfmon analysis, we need to process the perfmon logs for 500,000 machines on daily basis.

1. The data has to be ingested into the system  – Azure Data Lake Store
2. The data has to be cleaned  – Azure Data Lake Analytics / Hive Queries / Pig Queries
3. The data has to be transformed for the reporting/aggregation – Azure Data Lake Analytics / Hive Queries / Pig Queries/ Machine Learning model
4.  The data has to be inserted to a destination for reporting – Azure SQL DW or SQL Azure DB

How to run all these steps in sequence and on regular intervals? ADF is the solution for all that. To use ADF, we need to create ADF account first:



Once you click create, you will this screen:




This is the dashboard which we will use to create the transformation using ADF.  In the next post, I will write about how to create pipeline for transformation.


Build Smart Solutions using Big Data Stack on Microsoft Azure Platform – Azure Data Lake Store

Let’s start with advanced storage which we have got on Microsoft Azure. Now we have two options for storage 1. Blob Storage 2. Azure Data Lake Storage(ADLS). ADLS is more optimized for analytics workload therefore, when it comes to Big Data/Advanced analytics ADLS should be the first choice. Moreover, when we talk about Big Data, one must understand the concepts of HDFS (Hadoop Distributed File System) and Map Reduce. For more information, please check – Video

Before we get into Azure Data lake Store, it’s really important to understand Azure Data Lake is a fully managed Big Data Service from Microsoft. It consists of three major components:

1. HDInsight (Big Data Cluster as a Service) (It further has 5 types of clusters)
We have an option create any of these 5 types of the cluster as per the needs.

2. Azure Data lake Store (Hyper Scale Storage optimized for analytics)
3. Azure Data Lake Analytics ( Big Data Queries as a Service)

ADLS is HDFS for Big Data Analytics on Azure. The major benefits, it serves are:
1. No Limits on the file size – maximum file size can be in PBs
2. Optimized for Analytics workload
3. Integration with all major Big Data Players
4. Fully managed and supported by Microsoft
5. Can store data of any file formats
6. Enterprise ready with the features like access control and encryption at rest

It’s really simple to create an Azure Data Lake Store Account:

Step 1:  Search for the Azure Data Lake Service on the Portal

Step 2:  Enter the Service Name , Resource Group name and choose the appropriate location. Currently, it’s under preview and there will be limited options on the location of the data centers.


Step 3 : Use Data Viewer to upload and download the data – if the size of the data is small.


However, you have options to upload the data to ADL Store using various tools like ADL Copy or Azure Data Factory Copy data Pipeline to upload/download the data from ADL store. As shown the above picture, you can easily monitor the number of requests and data ingress/egress rate from the portal itself.  In the next blog post, we will talk about leveraging ADL store for ADL analytics and Azure Data Factory.