Archive for the ‘bigdata’ Tag

De-Mystifying Hadoop – Part 2

This is in continuation with an earlier post linked here.

At the heart of Hadoop is the Hadoop Distributed File System (HDFS) and MapReduce programming model. These two components form the crux of the Hadoop eco-system.

Hadoop Distributed File System (HDFS)

Definition – HDFS is is a distributed, highly fault-tolerant file system designed to run on low-cost commodity hardware. HDFS provides high-throughput access to application data and is suitable for applications with large data sets.

HDFS is the storage system to store large amount of data typically in terabytes and petabytes. It comprises of interconnected nodes where files and directories reside. A single HDFS cluster will have one node known as the NameNode that manages the filesystem and keeps a check on other nodes which act as data nodes. The data is stored on the data nodes in blocks (default size is 64 Mb). Whenever data is written  to one of the nodes, data is replicated to other nodes in pipeline mode.

The data sent from the client follows the Client – Data Node 1 – Data Node 2 – Data Node pipeline path. However, the Client will be notified immediately once the data is written to Data Node 1 irrespective of whether the data replication has been complete or not.

HDFS was designed to run on the assumption, that at any point one of the nodes can go down. This assumption drove HDFS to the need for a design with quick fault-detection and recovery and data replication, to ensure no loss of data.

HDFS work on the Write-Once-Read-Many access model. The raw data initially transferred to HDFS remains as-is and all the processing that gets done on the data is stored in separate files enabling any time re-look up of the golden copy of data.

hdfs2

MapReduce Programming Model

The name MapReduce comes from the two functions – Map() and Reduce() from functional programming. Map is a procedure that performs filtering and sorting of data while Reduce procedure performs the final summary operations to generate the result. This is where the actual code gets written and MapReduce is what makes Hadoop powerful. You can go on writing n number of Map-Reduce programs on the same dataset to generate different views.  As the code is moved to data nodes, it ensures parallel execution without the developer needing to think about parallelism during development.

From a conceptual standpoint, there are three steps in MapReduce model

  1. Mapping

Data in the form of key-value pairs in passed in to the Map() method. Map method processes the key value pairr based on the type of data set and generates another  set of key-value pairs. This happens on individual data node. The output key-value pairs are placed on a shared space in HDFS.

  1. Shuffling

The output key-value pairs from the mapping functions from different nodes are now sorted in this phase. This is the only hase when data is shared across nodes. So from a performance standpoint, this becomes the key to optimize for performance. There is a concept of Combiners, which is a mini-Reduce program that gets executed on that particular data node itself. Similarly there is concept of Partitioning in which all the similar Keys are partitioned into similar buckets. Both these concepts help in improving the performance of the MapReduce program.

  1. Reduce

All the key-values after shuffling is fed into the Reduce function to get the necessary solution. The Reducers may or may not reside on all data nodes though it is recommended  to have them on every node. Importantly, users never marshal the information between the data nodes. This marshaling is  done within the Hadoop platform guided implicitly by the different keys associated with values.

References:

PS: My objective of these posts is only to provide .the conceptual understanding of Hadoop as a whole and I am also in a learning phase. So do point out  any inaccuracies so that I can  improve my understanding. Believe, there will  be two more posts to cover the remaining topics as each topics are exhaustive in themselves.

Advertisements
%d bloggers like this: