Archive for the ‘hadoop’ Category

De-Mystifying Hadoop – Part 2

This is in continuation with an earlier post linked here.

At the heart of Hadoop is the Hadoop Distributed File System (HDFS) and MapReduce programming model. These two components form the crux of the Hadoop eco-system.

Hadoop Distributed File System (HDFS)

Definition – HDFS is is a distributed, highly fault-tolerant file system designed to run on low-cost commodity hardware. HDFS provides high-throughput access to application data and is suitable for applications with large data sets.

HDFS is the storage system to store large amount of data typically in terabytes and petabytes. It comprises of interconnected nodes where files and directories reside. A single HDFS cluster will have one node known as the NameNode that manages the filesystem and keeps a check on other nodes which act as data nodes. The data is stored on the data nodes in blocks (default size is 64 Mb). Whenever data is written  to one of the nodes, data is replicated to other nodes in pipeline mode.

The data sent from the client follows the Client – Data Node 1 – Data Node 2 – Data Node pipeline path. However, the Client will be notified immediately once the data is written to Data Node 1 irrespective of whether the data replication has been complete or not.

HDFS was designed to run on the assumption, that at any point one of the nodes can go down. This assumption drove HDFS to the need for a design with quick fault-detection and recovery and data replication, to ensure no loss of data.

HDFS work on the Write-Once-Read-Many access model. The raw data initially transferred to HDFS remains as-is and all the processing that gets done on the data is stored in separate files enabling any time re-look up of the golden copy of data.


MapReduce Programming Model

The name MapReduce comes from the two functions – Map() and Reduce() from functional programming. Map is a procedure that performs filtering and sorting of data while Reduce procedure performs the final summary operations to generate the result. This is where the actual code gets written and MapReduce is what makes Hadoop powerful. You can go on writing n number of Map-Reduce programs on the same dataset to generate different views.  As the code is moved to data nodes, it ensures parallel execution without the developer needing to think about parallelism during development.

From a conceptual standpoint, there are three steps in MapReduce model

  1. Mapping

Data in the form of key-value pairs in passed in to the Map() method. Map method processes the key value pairr based on the type of data set and generates another  set of key-value pairs. This happens on individual data node. The output key-value pairs are placed on a shared space in HDFS.

  1. Shuffling

The output key-value pairs from the mapping functions from different nodes are now sorted in this phase. This is the only hase when data is shared across nodes. So from a performance standpoint, this becomes the key to optimize for performance. There is a concept of Combiners, which is a mini-Reduce program that gets executed on that particular data node itself. Similarly there is concept of Partitioning in which all the similar Keys are partitioned into similar buckets. Both these concepts help in improving the performance of the MapReduce program.

  1. Reduce

All the key-values after shuffling is fed into the Reduce function to get the necessary solution. The Reducers may or may not reside on all data nodes though it is recommended  to have them on every node. Importantly, users never marshal the information between the data nodes. This marshaling is  done within the Hadoop platform guided implicitly by the different keys associated with values.


PS: My objective of these posts is only to provide .the conceptual understanding of Hadoop as a whole and I am also in a learning phase. So do point out  any inaccuracies so that I can  improve my understanding. Believe, there will  be two more posts to cover the remaining topics as each topics are exhaustive in themselves.


De-Mystifying Hadoop – Part 1

There has been quite  a buzz on big data and hadoop. Frankly, self had also taken initiative and try to readup stuff on the internet. Somehow, I never was able to put my mind and  soul to learning the concepts. So then when a training came around, I jumped on to it.

Interestingly, after finishing the training lot of my colleagues came up to me asking about it as everyone had heard about it but nobody could actually elucidate. So my purpose is just to provide a simplistic view of hadoop, which can help anyone get started on it and can build their knowledge base from there.

I am not an expert here, but just sharing what I have understood. So here I am taking the plunge to demystify big data by putting in my two cents hoping it helps someone find their calling 🙂

With the maturing of internet, the amount of data getting generated is humungous. Businesses want  to tap this huge data source to improve their service offerings to their customers and make money from it. The challenge was the that the traditional databases were ill-equipped to handle huge amounts of data as they were never designed to handle such volumes of data in the first place. To add a bit more detail to the challenges –

  • Fundamentals have still not changed – Disk Seek is still the slowest performing component in any software architecture.

Any new design or principle had to work around this problem. Traditionally, there will be business logic which will retrieve data as and when needed. However, when amount of data increased, disk speed would come out as the biggest bottleneck. Oracle had introduced memcache to reduce the  issues.

Additionally there were ETL tools such as Informatica and Ab Initio. These tools focused on end of day processing and were very good at that. But these tools came up short when it came to online processing.  So there’s the challenge – Design something that can process huge  volumes of data and give near real-time information to businesses and customers.

  • Rise of Unstructured Data

With the number of podcast and videos being shared, storing and making the unstructured data  searchable became a necessity. There are more content creators in the world at present than it ever has been in the entire history of human beings. We needed applications that would help us find useful content from the cacophony of information. To put it simply, we need to filter information from the noise.

  • Resiliency and Availability

There were too many instances when the a blackout caused issues in recovering the data to the original state. Importantly, blackouts/downtime cost money. There are high availability systems which needs 99.999% availability and even if it fails, it should be back up in seconds if not in minutes. Many critical systems  do have availability in place, yet it normally comes at a high cost.


How does Hadoop does it?
Hadoop identifies the above three critical challenges and others that I might have missed to come up with a  solution which is easier to implement and which is open-source.  Now for the big paradigm shift that Hadoop brings which I really thinks makes all the difference.

All the earlier models focused on bringing the data to application server (code) to do the processing. Hadoop does it the other way around. It sends the code to all the data for processing. This enables parallel processing of the data without streaming any data.

The code size being very small and repeatable can be cached and sent to the servers hosting data. This relieves the  developers from writing code keeping parallel processing in mind.

Another key improvement is that the servers having data do not need to be high end servers. They can be small servers with processing and memory just a little bit more than what we get in standard laptop. To put it in numbers a 16 G memory machine with an I5 processor should be sufficient to act as one of the nodes.

From my perspective this was the basis from which everything else evolved. However, this is just the beginning and there is a huge new world that hadoop has created.

PS: I had planned to cover entire Hadoop in one post. However, the increasing length of the post has changed my mind. Hopefully I can finish it in the second part itself.

I have not covered history of Hadoop as I wanted to  focus on the concepts so that everyone can get started. However, I would really recommend reading the fascinating history of Haddop. Last but not the least I am thankful to Venkat Krishnan who imparted a very exhaustive but insightful training.

P.PS: Do post your feedback so that I can improve as needed.

%d bloggers like this: