It’s been really long, ever since I wrote any post. It’s been a really busy month otherwise, It’s really difficult to stay away from writing – consciously. This is the final post from this series of posts about NoSQL technologies.
My Intent for this series of post, is to cover breadth of technologies to help the readers understand the bigger picture. It feels like a revolution where the technology is growing at a massive scale. Everyone must have , at least a basic understanding about the technologies like Cloud Computing/NoSQL/Big Data/Machine Learning etc.
Okay, let’s talk about Column family store aka Columnar family. The best way to explain this will be, by taking an example of Columnstore Indexes starting SQL Server 2012. In the traditional SQL Server tables the data is stored in the form of rows:
Referring to above picture – If we need to select column data , in RDBMS systems the data is stored in the unit of a complete row on the page. Even if we select a single column from the table, entire row has to be brought in the memory.
For business intelligence reports , we generally rely on aggregations like sum /avg /max /count etc. Imagine aggregation of a single column on a table with 1 billion rows will have to scan entire table to process the query. On the other hand, if the data is stored in the form of individual columns, then aggregations reduce huge number of I/Os. Moreover, Columnar databases offer huge compression ratio which can easily convert a 1 TB into few GBs.
This database system is specifically designed to cater to aggregations and BI reporting e.g. If you want to find and average hits on each website on the web from the terabytes of size of a table, it’s probably going to take days but with Columnar database, the benefit that we will get is:
1. The data will picked depending on the columns selected in the query instead if the entire row. Mostly, these aggregations go for scans and scanning TBs of data is going to take very long. By leveraging Columnar databases, the I/O will drastically reduce.
2. Using Columnar databases e.g. HBase, we can leverage distributed processing to fetch the result faster. As we know, with these NoSQL technologies , we can scale out really well and can leverage power of distributed queries across multiple machines.The data which takes days to process can be processed within minutes or seconds.
Major Known players for Columnar databases are: HBase, Cassandra, Amazon’s DynamoDB and Google’s Bigtable.
References: – https://www.youtube.com/watch?v=C3ilG2-tIn0