Summary of the learning in the year 2016-2017

It’s been really long since I wrote any post. I wanted to write this post to talk about how the last year has been and how I have been trying to evolve in the data platform technologies. Last year , I got an opportunity to peep outside RDMS world and I found to be a whole new dimension of data. I remember, I attended one of the sessions in Year 2015 which amazed with the power of data. This session was about Data Platform technologies where I saw this:


This is the list of different data platform companies which are working to create products in various categories like Relational/NoSQL/Caching etc. It’s a long list Smile 

I had heard a lot about NoSQL and Big Data but I never got a chance to learn or get into all this. But last year, I decided to learn what it’s all about. I started learning NOSQL and I even wrote a series of posts on the concepts of NOSQL from a DBA perspective. I went in depth of documentStore like MongoDB and DocumentDB and organized an event on NoSQL for the community. My intent has always been to share what’s new in the market and help community evolve as per the latest trend.

For the last almost eleven years of my career, I have immersed myself in technical learning and I wanted to learn about some bit of architecture and planning. I wanted to learn

1. What exactly senior leadership team does
2. How the business planning is done
3. How the business solutions and technical solutions are made to solve a business problem
4. How the architectures are created

In summary, how the entire business functions and what are the various types of profiles involved to solve a business problem. I got an opportunity to do TOGAF certification and I grabbed it whole heartedly. It was a tough challenge to align myself with the business processes to clear this exam. However, it was quite interesting to know the various types of architectures that are created to solve a business problem. It was fabulous experience to learn about TOGAF and getting certification cleared in the first attempt. I wrote a small blog post on this as well.

Later, I moved to Big Data where I thoroughly learnt about Entire Big Data stack on Microsoft Azure Platform i.e. Cortana Analytics Suite. This completely changed my point of view of looking at data and I was blown away with the power of Big Data.


One great advantage of this era of cloud computing is, every technology or product is getting integrated at one single location. For past couple of years, I have been working on SQL Server on Azure and on Premise and It was the perfect time to extend my learning to Big Data on Azure. I learnt about Cortana Analytics Suite and also did some customer projects as well. Moreover, We organized two events for the SQL Server Community in North India region and wrote some posts to share the understanding. 

Another interesting subject Machine Learning , Oh boy! it’s great to understand the power of machine learning and the kind of impact it could bring to the various business problems. Actually, it’s very difficult to master these technologies but understanding where these technologies fit in the solutions is really interesting.  Fortunately, I got opportunities with the customers for POC and Education deliveries. It further helped me to delve deeper in these technologies. To enhance my knowledge further on the big data solutions, I started preparing for 70-475 Designing and Implementing Big Data Analytics Solutions. It covers:

1. Stream Analytics
2.  Azure Data Factory
3. Machine Learning
4. HDInsight
5. Azure Data Lake Store
6. Storm/Spark
7. R

The exam was really tough and had to lots of hands on during preparation. However, cleared it in the first attempt and got lots of learning points further on. I wanted to write this post to share my non-RDBMS learning in the recent past. It may help the data professionals to find a new S curve and learn something new.

I am going to move to a new position in Microsoft however, the passion for writing and community work will still be on.  Soon, there is a plan to have a community meet for SQL on Linux and the new features of SQL Server 2017.

Stay Tuned for more updates!

Build Smart Solutions using Big Data Stack on Microsoft Azure Platform – Build HDInsight Cluster

We have discussed about the following components so far:

  • Azure Data Factory  -> It’s a transformation service for the Big Data Stack on Azure
  • Azure Data Lake Store –> It’s a storage to store any types of any file formats on Azure
  • Azure Data Lake Analytics –> It’s a compute which process the data on Azure

Now, let’s talk about next managed service know as Big Data Cluster as a Service. For ADLA, we just write the query , select the parallelism and execute the query – without worrying about what’s happening underneath. For HDInsight cluster, you will get virtual machines for RDP where you could write your hive or pig queries and manage within preselected sources.

Let’s see how to create this cluster:



Just write HDinsight in the search textbox and option to create HDInsight cluster will appear.


Select the option Hadoop from the drop down and choose appropriate options. One of the main screens is:


selecting the number of worker nodes. Based on the compute requirement, you could select worker nodes. Once rest of the inputs are completed, click create and it’ll roughly take around 20 minutes to setup your cluster.  you could further check the progress by checking the below options from the portal:




Build Smart Solutions using Big Data Stack on Microsoft Azure Platform – Azure Data Factory (Part 2)

In the previous post, I wrote about creating ADF account to do the transformations. Now, let’s learn about how to create pipelines with ADF. I will explain the functionality of the ADF in this post. Moreover, I will share a link where you can find the demo to create a pipeline.

There are four major components of ADF:

1. Linked Services
2. Data Sets
3. Activities
4. Pipelines
5. Data Gateways


Linked Services : – Linked Services are used to link resources to the ADF e.g. if we need to process perfmon file, we first need to create linked services for the storage where it will pick the file from. Another Linked service for the compute which will process the file and so on.

Storage : – As shown below, there are various options to pick the files from: We need to select the source of the data and destination of the data. e.g. if I need to pick the data from Azure Storage where the files are kept, I will pick Azure storage for the data store. Moreover, if the transformed data has to be stored in SQL server or DocumentDB – I will pick the respective options from the same list:


Compute- Once Source and destination data store is selected , we need to select the compute now. For compute, we need to select:


In this example, I want to process the file using ADLA – USQL Queries so, I will choose Azure Data Lake Analytics. However, if I had to process the file using hive or pig queries – I would have chosen HDInsight or On demand HDInsight cluster.

Data Sets – It’s the same concept like Reporting Service reports/SSIS package. We first choose the data source which may anything like SQL Server Database or MS Access DB. Similarly,here once the source and destination is selected, we need to choose the data set which can be file/folder or table/collection.

In out current scenario, the files exist on Azure Blog storage and we have already created a linked service for the Blob storage. Therefore, the data set will also be for Azure Blob Storage:


In this Data Set you will mention the actual file name or folder you want to pick.

Activities – Activities define actions to be performed on the data.  The relationship between all the components is as follows


Pipeline- Once the source and destination of the file is decided, file to work on is selected and compute is decided. It’s the time to create a pipeline with the set of activities for the transformation:

Data GatewayIf you want to move data to and fro on-premise to Azure, then you use Data gateway.


Link to demo which you can try yourself –

References –

Build Smart Solutions using Big Data Stack on Microsoft Azure Platform – Azure Data Factory (Part 1)

In the previous posts, I wrote about Azure Data Lake Store (ADLS) and Azure Data Lake Analytics (ADLA). To make you understand better, let’s say ADLS works as a SAN and ADLA works as compute (server with RAM and CPUs). They make a great server machine together. Now, what is Azure Data Factory ?

Azure Data Factory(ADF)  is a framework that’s used for data transformation on Azure. Like, we have SSIS service for On-premise SQL Server similarly, ADF is a transformation service for Azure data platform services – primarily. Let’s take the same example for perfmon analysis, we need to process the perfmon logs for 500,000 machines on daily basis.

1. The data has to be ingested into the system  – Azure Data Lake Store
2. The data has to be cleaned  – Azure Data Lake Analytics / Hive Queries / Pig Queries
3. The data has to be transformed for the reporting/aggregation – Azure Data Lake Analytics / Hive Queries / Pig Queries/ Machine Learning model
4.  The data has to be inserted to a destination for reporting – Azure SQL DW or SQL Azure DB

How to run all these steps in sequence and on regular intervals? ADF is the solution for all that. To use ADF, we need to create ADF account first:



Once you click create, you will this screen:




This is the dashboard which we will use to create the transformation using ADF.  In the next post, I will write about how to create pipeline for transformation.


Build Smart Solutions using Big Data Stack on Microsoft Azure Platform – Azure Data Lake Analytics (Part 2)

After learning how to create Azure Data Lake Analytics (ADLA) account. It’s the time to write some queries to leverage this account. As we know, U-SQL is a query language for this platform. The best thing about Microsoft Big Data Stack on Azure is, the query languages are SQL like and are really easy to understand.

Let’s see how to leverage ADLA and write U-SQL queries. There are options like:

1. Submit the job directly from the portal –


Major Parameters are as follows:

1. Job Name –> Name of the Job
2. Parallelism –> Maximum number of compute processes that can happen at the same time
3. Priority –> Lesser the value , higher the priority is. Job with higher priority will run first.
4. Query Editor –> Write your U-SQL Queries

2. Other tools like:

Download the add-ins as per the preference. Another example, I will show you is from Visual Studio. After installing the tools for visual studio, the options will look like:



Just select the first option and the interface will be like this:


On the left hand side, you are seeing the Azure Analytics account, underneath that we have ADLA database (Master as a default). It gives a feel of SQL Server DBs underneath which we can have procedures/tables/views etc.

In the middle, there is an option to select the database (a new DB can be created), schema of the object , ADLA account (Local is by default) and then Submit. Moreover, if you click the drop down underneath submit , you can even select the parallelism and priority.

On the right side, you can register/create assemblies for the programming purpose and then later use those in the U-SQL queries.

After submitting the job when it completes, the interface looks like:

As shown in the pic, you can see the status of job and see how long it took to run the entire job.

Moreover, to learn U-SQL, please follow –

Let’s finish the all the components of Cortana Analytics Suite. After that, we will pick up a real life scenario and explain how all these components fit together.


Build Smart Solutions using Big Data Stack on Microsoft Azure Platform – Azure Data Lake Analytics(Part 1)

As we have set the context right through previous posts, now it’s time to understand how Big Data Queries are written and executed. As we know, we can store the data on Azure data lake store and there will be a use case for that. Let’s take a very easy example of Perfmon data – e.g. I have written some queries to process the perfmon data on daily basis. Let’s say, we want to find out, how many servers out of 500,000, servers faced memory pressure. We have automated perfmon data collectors scheduled on all the systems and the logs need to be analyzed on the daily basis.


1. Perfmon data collector files in CSV format are saved on Azure data lake store
2. Need to process all the files to find out the servers which faced memory pressure

In this scenario, we have options like put the data inside SQL Server and then do the analysis on the top of it. Analyzing perfmon data for 500,000 server is going to need lots of compute on SQL server and it may cost really heavy for the hardware. Moreover, the query has to be run just once per day. Do you think, it’s wise purchase 128 core machine and with TBs of SAN to do this job? In such case, we have options to process the data using Big Data solutions.

Note – I have used this very simple example to help you understand the concepts. We will talk about real life case studies as we move forward. 

In this particular scenario, I have choices like:

1. Use Azure Data Lake Analytics
2. Use Azure Data Lake HDInsight Hive cluster

For this post, I will pick Azure Data Lake Analytics (ADLA). This particular Azure service is also known as Big Data Query as a Service. Let’s first see how to create ADLA:

Step 1

Step 2  Enter the Data Lake Store detail for the storage and other details


In above steps, we have create compute account i.e. Azure Data Lake Analytics account which will process the files for us. ( Analogically, one machine with set of processors/RAM(ADLA) and for storage we added ADL store to the account). In Azure, we have both storage and compute as different entities. It helps to scale either compute or storage independent of each other.

Step 3 – After clicking create, the dashboard will look like this:


Now, both the compute (to process the files) and storage (where the perfmon files are stored) is created. As this service is big data query as a service, we can just write big data queries which internally will be executed by Azure platform automatically. It’s a PaaS service like SQL Azure DB where you just write your queries without bothering about what machine is underneath or where the files are stored internally.

Analogically, it’s a broker for you who you hand over the files , give him the instructions  , instruct how many people should work on the task (for compute) and then he shares the results with you. This broker understand U-SQL as a language like T-SQL is for SQL Server. If you want to get your task done, you need to write U-SQL queries and submit to the ADLA. Based on the instructions and compute defined by you, it will return the results.
Let’s talk about framework to write U-SQL Queries in the upcoming posts.



Build Smart Solutions using Big Data Stack on Microsoft Azure Platform – Azure Data Lake Store

Let’s start with advanced storage which we have got on Microsoft Azure. Now we have two options for storage 1. Blob Storage 2. Azure Data Lake Storage(ADLS). ADLS is more optimized for analytics workload therefore, when it comes to Big Data/Advanced analytics ADLS should be the first choice. Moreover, when we talk about Big Data, one must understand the concepts of HDFS (Hadoop Distributed File System) and Map Reduce. For more information, please check – Video

Before we get into Azure Data lake Store, it’s really important to understand Azure Data Lake is a fully managed Big Data Service from Microsoft. It consists of three major components:

1. HDInsight (Big Data Cluster as a Service) (It further has 5 types of clusters)
We have an option create any of these 5 types of the cluster as per the needs.

2. Azure Data lake Store (Hyper Scale Storage optimized for analytics)
3. Azure Data Lake Analytics ( Big Data Queries as a Service)

ADLS is HDFS for Big Data Analytics on Azure. The major benefits, it serves are:
1. No Limits on the file size – maximum file size can be in PBs
2. Optimized for Analytics workload
3. Integration with all major Big Data Players
4. Fully managed and supported by Microsoft
5. Can store data of any file formats
6. Enterprise ready with the features like access control and encryption at rest

It’s really simple to create an Azure Data Lake Store Account:

Step 1:  Search for the Azure Data Lake Service on the Portal

Step 2:  Enter the Service Name , Resource Group name and choose the appropriate location. Currently, it’s under preview and there will be limited options on the location of the data centers.


Step 3 : Use Data Viewer to upload and download the data – if the size of the data is small.


However, you have options to upload the data to ADL Store using various tools like ADL Copy or Azure Data Factory Copy data Pipeline to upload/download the data from ADL store. As shown the above picture, you can easily monitor the number of requests and data ingress/egress rate from the portal itself.  In the next blog post, we will talk about leveraging ADL store for ADL analytics and Azure Data Factory.