Build Smart Solutions using Big Data Stack on Microsoft Azure Platform – Azure Data Lake Analytics(Part 1)

As we have set the context right through previous posts, now it’s time to understand how Big Data Queries are written and executed. As we know, we can store the data on Azure data lake store and there will be a use case for that. Let’s take a very easy example of Perfmon data – e.g. I have written some queries to process the perfmon data on daily basis. Let’s say, we want to find out, how many servers out of 500,000, servers faced memory pressure. We have automated perfmon data collectors scheduled on all the systems and the logs need to be analyzed on the daily basis.


1. Perfmon data collector files in CSV format are saved on Azure data lake store
2. Need to process all the files to find out the servers which faced memory pressure

In this scenario, we have options like put the data inside SQL Server and then do the analysis on the top of it. Analyzing perfmon data for 500,000 server is going to need lots of compute on SQL server and it may cost really heavy for the hardware. Moreover, the query has to be run just once per day. Do you think, it’s wise purchase 128 core machine and with TBs of SAN to do this job? In such case, we have options to process the data using Big Data solutions.

Note – I have used this very simple example to help you understand the concepts. We will talk about real life case studies as we move forward. 

In this particular scenario, I have choices like:

1. Use Azure Data Lake Analytics
2. Use Azure Data Lake HDInsight Hive cluster

For this post, I will pick Azure Data Lake Analytics (ADLA). This particular Azure service is also known as Big Data Query as a Service. Let’s first see how to create ADLA:

Step 1

Step 2  Enter the Data Lake Store detail for the storage and other details


In above steps, we have create compute account i.e. Azure Data Lake Analytics account which will process the files for us. ( Analogically, one machine with set of processors/RAM(ADLA) and for storage we added ADL store to the account). In Azure, we have both storage and compute as different entities. It helps to scale either compute or storage independent of each other.

Step 3 – After clicking create, the dashboard will look like this:


Now, both the compute (to process the files) and storage (where the perfmon files are stored) is created. As this service is big data query as a service, we can just write big data queries which internally will be executed by Azure platform automatically. It’s a PaaS service like SQL Azure DB where you just write your queries without bothering about what machine is underneath or where the files are stored internally.

Analogically, it’s a broker for you who you hand over the files , give him the instructions  , instruct how many people should work on the task (for compute) and then he shares the results with you. This broker understand U-SQL as a language like T-SQL is for SQL Server. If you want to get your task done, you need to write U-SQL queries and submit to the ADLA. Based on the instructions and compute defined by you, it will return the results.
Let’s talk about framework to write U-SQL Queries in the upcoming posts.




Build Smart Solutions using Big Data Stack on Microsoft Azure Platform – Azure Data Lake Store

Let’s start with advanced storage which we have got on Microsoft Azure. Now we have two options for storage 1. Blob Storage 2. Azure Data Lake Storage(ADLS). ADLS is more optimized for analytics workload therefore, when it comes to Big Data/Advanced analytics ADLS should be the first choice. Moreover, when we talk about Big Data, one must understand the concepts of HDFS (Hadoop Distributed File System) and Map Reduce. For more information, please check – Video

Before we get into Azure Data lake Store, it’s really important to understand Azure Data Lake is a fully managed Big Data Service from Microsoft. It consists of three major components:

1. HDInsight (Big Data Cluster as a Service) (It further has 5 types of clusters)
We have an option create any of these 5 types of the cluster as per the needs.

2. Azure Data lake Store (Hyper Scale Storage optimized for analytics)
3. Azure Data Lake Analytics ( Big Data Queries as a Service)

ADLS is HDFS for Big Data Analytics on Azure. The major benefits, it serves are:
1. No Limits on the file size – maximum file size can be in PBs
2. Optimized for Analytics workload
3. Integration with all major Big Data Players
4. Fully managed and supported by Microsoft
5. Can store data of any file formats
6. Enterprise ready with the features like access control and encryption at rest

It’s really simple to create an Azure Data Lake Store Account:

Step 1:  Search for the Azure Data Lake Service on the Portal

Step 2:  Enter the Service Name , Resource Group name and choose the appropriate location. Currently, it’s under preview and there will be limited options on the location of the data centers.


Step 3 : Use Data Viewer to upload and download the data – if the size of the data is small.


However, you have options to upload the data to ADL Store using various tools like ADL Copy or Azure Data Factory Copy data Pipeline to upload/download the data from ADL store. As shown the above picture, you can easily monitor the number of requests and data ingress/egress rate from the portal itself.  In the next blog post, we will talk about leveraging ADL store for ADL analytics and Azure Data Factory.



Build Smart Solutions using Big Data Stack on Microsoft Azure Platform – Cortana Analytics Suite(Intro.)

One stream which is really in huge demand today , is Big Data. With Cloud Platforms it’s got even more power and visibility. Earlier, Big Data was being done is silos but ever since Cloud computing has come into rhythm, this technology has gained even more traction in the world of technology. Everything today is getting integrated on Cloud.

As a relational guy, it’s been a great journey to learn this technology. It’s been really easy to grasp things like HDInsight or Azure Data Lake Analytics or even stream analytics for that matter. All these technologies are based on SQL like query language or .net or Powershell etc.. However, it’s a really great integration with open source technologies (Linux / NoSQL) or other languages like Java.

All the posts which I have written so far, have been with the point of view of a relational guy i.e. how as a DBA or Database developer you can pursue these technologies. This time, I am going to write the series for Big Data Stack on Microsoft Azure aka Cortana Analytics Suite.

Anything, that is available in the below diagram – I am going to write a post on that. Moreover, we will deep dive into few technologies like Machine Learning and Azure Data Lake .



Tips to prepare for TOGAF exam

After clearing my exam, I have been getting lots of queries on how to prepare for this exam. I wanted to share my experience with community so that, it can help other folks with their prep. I took this exam in the month of June, 2016. There are multiple options to take this exam:

1. TOGAF foundation
2. TOGAF Certified
3. TOGAF Foundation & Certified combined
4. TOGAF Bridge exam (TOGAF 8 to TOGAF 9 upgrade)

I took the 3 option , TOGAF foundation and Certified combined exam. There are various option to prepare – one is download the book from The Open Group website or the other one is to go for instructor led training. I am not sure if people have cleared the exam without any expert advice or based or just based on their prior experience. However, I have cleared the exam after taking training from Simplilearn. I really liked their trainers, they were experienced professionals and very knowledgeable about this subject.

Training and Self study material:
Especially, Satish Byali – I really liked the way he trained us by sharing real life experiences. Following are the great resources he shared , available outside for free:

1. He shared some demos of the tool like Abacus which is used to create the architecture based on TOGAF standards.
2. Various great examples like E-Pragati an initiative by Andhra Government – they are using TOGAF architecture standards. Some resources worth a look:

3. Some other good resources:

4. Most important one : Posters for each phase – . You need to remember every bit of all these. These posters talk about inputs and outputs of each phase and various deliverables which are must knows for the exam.

Reading from the book is one thing, but if we know what really happens at the ground level e.g. look at the abacus tool , it shares all the artifacts’ created at each phase. If you look at them how they look in reality and how to create them , will help to absorb these concepts well and relate them. You can download the trial version of Abacus to play around for one month.

Taking simulation test and tips from internet search
There are two simulation test for Foundation and certification exam each which you get in Simplilearn website. Moreover, there are 80 bonus questions extra for practice. I had taken those tests 3-4 times each. Moreover, I also did some internet search related to gather more exam tips and here are the ones which really helped:


Must watch Video series –

On the exam day:
Finally, after two months of preparation – the exam day came. The first tip , I will give is be confident and don’t panic. My exam was at 12:30 PM and the test center was 60 KMs away from my home. I had done all the research on how to reach the center , prior to one day of the exam. It helped to avoid un-necessary stress and I reached the exam center by 12:00 PM, as planned. Things were really smooth and I was confident to take the exam.

At the center –
1. Those Guys asked for two ID proofs as per the process so, please carry two. BTW, I wasn’t aware of that.
2. They gave a locker for keeping all my stuffs. Nothing was allowed to be carried inside the test center but just two ID proofs.
3. They do a security test before taking you the test center
4. You are allowed to take the breaks in between provided the clock for the test has to be running. Also, they will do a security check again.

Foundation Test :
  Finally, the time came when I had to click the right options. There is no negative marking therefore you should attempt all the 40 questions. 22 is the passing score for this exam and I scored 31 which was above expectations.

1.  I finished all the questions within 30 minutes and marked all the questions which I was not sure with the answers
2. I reviewed all the questions once again and you must do this. Because , you will find some answers in the next questions sometimes.
3. I reviewed all the marked questions in the second round again.

I finished the test 5 minutes prior to the completion time. I was moved to the certification test straight away:

Certified Test:  As soon as I clicked finished for the foundation exam, I was at the first question of certification exam. You have to go through long paragraphs to understand the scenario and then there will be questions and 4 options to select from.

1. The best tip i got , was to select an answer which is really TOGAFish. You will see the answers talk about deliverables and artifects.
2. This is open book exam, I confirmed the artifacts and deliverables mentioned in the answers were correct or not.
3. The book is really helpful, you can search anything from the content and text of the chapters in the book.
4. If you practice well in the demo tests, you will get an idea of what types of questions you will get and will be really helpful in the exam
5. I’d finished my exam in 1 hour and then for next 30 minutes, I reviewed all the questions till the time was over.

When I clicked finish, I got my result flashed in front of me and it was a success. After clearing the exam, you will get an email from Open Group.. Use the instructions mentioned in the email to retrieve your certificate.

I hope reading these tips will help you and I will again say – don’t panic and just go for it. All the Best!

Embrace NoSQL as a relational Guy – Column Family Store

It’s been really long, ever since I wrote any post. It’s been a really busy month otherwise, It’s really difficult to stay away from writing – consciously. This is the final post from this series of posts about NoSQL technologies.

My Intent for this series of post, is to cover breadth of technologies to help the readers understand the bigger picture. It feels like a revolution where the technology is growing at a massive scale. Everyone must have , at least a basic understanding about the technologies like Cloud Computing/NoSQL/Big Data/Machine Learning etc.

Okay, let’s talk about Column family store aka Columnar family. The best way to explain this will be, by taking an example of Columnstore Indexes starting SQL Server 2012. In the traditional SQL Server tables the data is stored in the form of rows:


Referring to above picture – If we need to select column data , in RDBMS systems the data is stored in the unit of a complete row on the page. Even if we select a single column from the table, entire row has to be brought in the memory.

For business intelligence reports , we generally rely on aggregations like sum /avg /max /count etc. Imagine aggregation of a single column on a table with 1 billion rows will have to scan entire table to process the query. On the other hand, if the data is stored in the form of individual columns, then aggregations reduce huge number of I/Os. Moreover, Columnar databases offer huge compression ratio which can easily convert a 1 TB into few GBs.

This database system is specifically designed to cater to aggregations and BI reporting e.g. If you want to find and average hits on each website on the web from the terabytes of size of a table, it’s probably going to take days but with Columnar database, the benefit that we will get is:

1. The data will picked depending on the columns selected in the query instead if the entire row. Mostly, these aggregations go for scans and scanning TBs of data is going to take very long. By leveraging Columnar databases, the I/O will drastically reduce.

2. Using Columnar databases e.g. HBase, we can leverage distributed processing to fetch the result faster. As we know, with these NoSQL technologies , we can scale out really well and can leverage power of distributed queries across multiple machines.The data which takes days to process can be processed within minutes or seconds.

Major Known players for Columnar databases are: HBase, Cassandra, Amazon’s DynamoDB and Google’s Bigtable.
References: –

Embrace NoSQL as a relational Guy – Key Value Pair

There are majorly two types of Key Value pair DBs, 1. Persisted  2. Non-persisted (cache based). This is a very popular type of NoSQL database e.g.

Persisted –> Azure tables, Amazon Dynamo, CouchBase etc.

Non-Persisted –> Redis Cache , Memcached etc.  (Main purpose is caching on the websites)

The data in these databases is stored in the form key and Value:image

The data is accessed based on the key and the value can be JSON or XML or image or any thing which fits in blob storage i.e. the value is stored in the form of blobs. Like other NoSQL DBs ,they are not schema bound. For the ecommerce websites, if we want to store the information about a customer shopping. We can have Key as customer id and value as all the shopping information. Since they have all the required data stored in a single unit in the form of a value, it can be scaled really efficiently.
Another use case for Key value store is , storing session information e.g. a game where millions of users are active online, their profile information can be stored in key value pair. These databases can handle massive scale easily and it can also provide redundancy to avoid loss of data. Moreover, there are many applications which are used to just store information in the form of images , can leverage this key value pair database easily.

As a RDBMS guy, it’s little difficult to relate to these databases but just try to understand the context for now. We will try to discover more about these databases as we go along. In a summary, that’s another factor which influence our decision to choose a NoSQL DB : (Tables refer to Azure tables (key value pair DBs)):


References –


Embrace NoSQL technologies as a relational Guy – Graph DB

It’s a fact that NoSQL technologies are growing at a rapid pace. I even heard someone saying NoSQL is old now , NewSQL is the new trend. NewSQL gives performance of NoSQL and follows ACID principals of RDBMS system. Anyways, lets focus on Graph DB for now.

Graph databases are specialized in dealing with relationship data. They specialize in finding the patters and relationships between the certain events or employees of organization or certain operations. It can help to make the application more interactive by suggesting more options based on the previous patterns of browsing or shopping.

Have you noticed:

1. Facebook offers option of suggested friends

2. LinkedIn offering suggested connections or connections from the same organization-that’s the use of Graph Databases.

3. Flipkart/Amazon offering “people also viewed” (Real time recommendations) options help you purchase more products.

4. Master data management where based on the support case, knowledge base article could be suggested for the faster resolution.

5. Dependency analysis of shutting down an IT operation i.e. users which may be impacted if this router is shut down for maintenance. It can help to send the advanced notification to those employees.

Graph DBs are being used for all of the above scenarios. Just check the below picture to see how relationships look like:


Neo4J is one of the best Graph DB companies today. The language used for Neo4J is Cypher. It’s being used largely by the major Tech./healthcare/manufacturing companies.

Please check this video for more details about relationship and properties. It’s series of videos which you could look for to understand more about this subject.