De-Mystifying Hadoop – Part 1

There has been quite  a buzz on big data and hadoop. Frankly, self had also taken initiative and try to readup stuff on the internet. Somehow, I never was able to put my mind and  soul to learning the concepts. So then when a training came around, I jumped on to it.

Interestingly, after finishing the training lot of my colleagues came up to me asking about it as everyone had heard about it but nobody could actually elucidate. So my purpose is just to provide a simplistic view of hadoop, which can help anyone get started on it and can build their knowledge base from there.

I am not an expert here, but just sharing what I have understood. So here I am taking the plunge to demystify big data by putting in my two cents hoping it helps someone find their calling 🙂

With the maturing of internet, the amount of data getting generated is humungous. Businesses want  to tap this huge data source to improve their service offerings to their customers and make money from it. The challenge was the that the traditional databases were ill-equipped to handle huge amounts of data as they were never designed to handle such volumes of data in the first place. To add a bit more detail to the challenges –

  • Fundamentals have still not changed – Disk Seek is still the slowest performing component in any software architecture.

Any new design or principle had to work around this problem. Traditionally, there will be business logic which will retrieve data as and when needed. However, when amount of data increased, disk speed would come out as the biggest bottleneck. Oracle had introduced memcache to reduce the  issues.

Additionally there were ETL tools such as Informatica and Ab Initio. These tools focused on end of day processing and were very good at that. But these tools came up short when it came to online processing.  So there’s the challenge – Design something that can process huge  volumes of data and give near real-time information to businesses and customers.

  • Rise of Unstructured Data

With the number of podcast and videos being shared, storing and making the unstructured data  searchable became a necessity. There are more content creators in the world at present than it ever has been in the entire history of human beings. We needed applications that would help us find useful content from the cacophony of information. To put it simply, we need to filter information from the noise.

  • Resiliency and Availability

There were too many instances when the a blackout caused issues in recovering the data to the original state. Importantly, blackouts/downtime cost money. There are high availability systems which needs 99.999% availability and even if it fails, it should be back up in seconds if not in minutes. Many critical systems  do have availability in place, yet it normally comes at a high cost.


How does Hadoop does it?
Hadoop identifies the above three critical challenges and others that I might have missed to come up with a  solution which is easier to implement and which is open-source.  Now for the big paradigm shift that Hadoop brings which I really thinks makes all the difference.

All the earlier models focused on bringing the data to application server (code) to do the processing. Hadoop does it the other way around. It sends the code to all the data for processing. This enables parallel processing of the data without streaming any data.

The code size being very small and repeatable can be cached and sent to the servers hosting data. This relieves the  developers from writing code keeping parallel processing in mind.

Another key improvement is that the servers having data do not need to be high end servers. They can be small servers with processing and memory just a little bit more than what we get in standard laptop. To put it in numbers a 16 G memory machine with an I5 processor should be sufficient to act as one of the nodes.

From my perspective this was the basis from which everything else evolved. However, this is just the beginning and there is a huge new world that hadoop has created.

PS: I had planned to cover entire Hadoop in one post. However, the increasing length of the post has changed my mind. Hopefully I can finish it in the second part itself.

I have not covered history of Hadoop as I wanted to  focus on the concepts so that everyone can get started. However, I would really recommend reading the fascinating history of Haddop. Last but not the least I am thankful to Venkat Krishnan who imparted a very exhaustive but insightful training.

P.PS: Do post your feedback so that I can improve as needed.


3 comments so far

  1. Punit S on

    The concept of executing code where data is was introduced with traditional databases as well – known as In-database. A very important differentiator with Hadoop is the redundancy that results into fault tolerance – which is very vital with Big data implementations.

    • Dishit D on

      Thanks man! I was not aware of this. I agree about the fault tolerance. Hadoop was designed with the assumption that nodes will go down which ensured fault tolerance in the end design.

  2. […] This is in continuation with an earlier post linked here. […]

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: