Build Smart Solutions using Big Data Stack on Microsoft Azure Platform – Azure Data Lake Analytics(Part 1)

As we have set the context right through previous posts, now it’s time to understand how Big Data Queries are written and executed. As we know, we can store the data on Azure data lake store and there will be a use case for that. Let’s take a very easy example of Perfmon data – e.g. I have written some queries to process the perfmon data on daily basis. Let’s say, we want to find out, how many servers out of 500,000, servers faced memory pressure. We have automated perfmon data collectors scheduled on all the systems and the logs need to be analyzed on the daily basis.

Scenario:

1. Perfmon data collector files in CSV format are saved on Azure data lake store
2. Need to process all the files to find out the servers which faced memory pressure

In this scenario, we have options like put the data inside SQL Server and then do the analysis on the top of it. Analyzing perfmon data for 500,000 server is going to need lots of compute on SQL server and it may cost really heavy for the hardware. Moreover, the query has to be run just once per day. Do you think, it’s wise purchase 128 core machine and with TBs of SAN to do this job? In such case, we have options to process the data using Big Data solutions.

Note – I have used this very simple example to help you understand the concepts. We will talk about real life case studies as we move forward. 

In this particular scenario, I have choices like:

1. Use Azure Data Lake Analytics
2. Use Azure Data Lake HDInsight Hive cluster

For this post, I will pick Azure Data Lake Analytics (ADLA). This particular Azure service is also known as Big Data Query as a Service. Let’s first see how to create ADLA:

Step 1

image
Step 2  Enter the Data Lake Store detail for the storage and other details

image

In above steps, we have create compute account i.e. Azure Data Lake Analytics account which will process the files for us. ( Analogically, one machine with set of processors/RAM(ADLA) and for storage we added ADL store to the account). In Azure, we have both storage and compute as different entities. It helps to scale either compute or storage independent of each other.

Step 3 – After clicking create, the dashboard will look like this:

image

Now, both the compute (to process the files) and storage (where the perfmon files are stored) is created. As this service is big data query as a service, we can just write big data queries which internally will be executed by Azure platform automatically. It’s a PaaS service like SQL Azure DB where you just write your queries without bothering about what machine is underneath or where the files are stored internally.

Analogically, it’s a broker for you who you hand over the files , give him the instructions  , instruct how many people should work on the task (for compute) and then he shares the results with you. This broker understand U-SQL as a language like T-SQL is for SQL Server. If you want to get your task done, you need to write U-SQL queries and submit to the ADLA. Based on the instructions and compute defined by you, it will return the results.
Let’s talk about framework to write U-SQL Queries in the upcoming posts.

 

HTH!

Advertisements

Build Smart Solutions using Big Data Stack on Microsoft Azure Platform – Cortana Analytics Suite(Intro.)

One stream which is really in huge demand today , is Big Data. With Cloud Platforms it’s got even more power and visibility. Earlier, Big Data was being done is silos but ever since Cloud computing has come into rhythm, this technology has gained even more traction in the world of technology. Everything today is getting integrated on Cloud.

As a relational guy, it’s been a great journey to learn this technology. It’s been really easy to grasp things like HDInsight or Azure Data Lake Analytics or even stream analytics for that matter. All these technologies are based on SQL like query language or .net or Powershell etc.. However, it’s a really great integration with open source technologies (Linux / NoSQL) or other languages like Java.

All the posts which I have written so far, have been with the point of view of a relational guy i.e. how as a DBA or Database developer you can pursue these technologies. This time, I am going to write the series for Big Data Stack on Microsoft Azure aka Cortana Analytics Suite.

Anything, that is available in the below diagram – I am going to write a post on that. Moreover, we will deep dive into few technologies like Machine Learning and Azure Data Lake .

image

HTH!