My Home lab     My Blog

Hadoop – An Introduction

Support Wikipedia Support Wikipedia Support Wikipedia

 

 

Companies are generating (or sourcing) more data than they could ever crunch with the traditional BI tools. Not saying that traditional BI guys are not capable, but sometimes it just doesn’t make sense to be paying for licenses, hardware, power and other infrastructure for jobs which run once in a week (or sometimes month). So what is hadoop? It is a combination of following three things –

1.       HDFS – A distributed file system which splits the individual files into multiple 64k blocks and spreads the chunks into at least 3 different data nodes. Why? So that if one node fails it can be regenerated from other nodes. AND allows parallel access to all these chunks.

2.       MapReduce – A new programming paradigm, which divides the jobs two-ways – Parallel processing of data + two-step break down of the problem (Map & Reduce). Map tasks access the data parallel on all the nodes and generate an intermediate output. Reduce tasks access this intermediate data (again in parallel) and produce the final output. Very specific use cases – log file analysis, web search index, anything with lots of text to be analyzed

3.       Hadoop – It’s the orchestration layer which does the following –

a.       Monitors all the nodes

b.      Monitors the processes of map and reduce for ex. reduce cannot start till map tasks have ended. If map process fails, it allocates the jobs to other map processes etc.

c.       Restores data in case one node fails

 

What are the business benefits –

1.       The hardware infrastructure can be standard off-the-shelf – commodity servers and standard networking switching with flat network topology

2.       No complex clustering software

3.       No costly BI software

4.       The complexity of the software reduces significantly because the code is not dependent on the hardware topology. It is costly to build high availability in each application

5.       If done on Amazon MapReduce, one can start with zero capex and use capacity purely on-demand.

6.       All open source

 

 

 

Technical advantages vis-ΰ-vis other solutions (paid) –

1.       In-built high-availability – self-healing – 3 levels –

a.       Partial failure – The tasks running on failed nodes are picked up and executed by the other available nodes

b.      Data recovery – The data lost from failed nodes is recovered using the data on available nodes

c.       Individual recoverability – The nodes being added to the cluster are able to join the group –

                                                                           i.      As an active working node

                                                                         ii.      Without any need to restart the group

d.      Consistency – Tasks failing and recovering on other nodes do not cause –

                                                                           i.      Data corruption

                                                                         ii.      System-wide hang or non-determinism

e.      Scalability

                                                                           i.      Increased load should lead to graceful decline of performance and not outright failure

                                                                         ii.      Increased resources should lead to proportional increase in capacity

2.       General purpose

3.       Low entry barrier – no schema related complexities, just load the data and ready to go

4.       Mostly for processing lots of data – Petabytes

5.       Designed for failures – commodity hardware will fail and code written ensures HA (how to calculate MTBF)

6.       Computations ‘come’ to data rather than the traditional way of bringing the data to the compute

7.       Scalability is horizontal – add compute nodes without worrying about the software. 

 

What is abstracted –

1.       From the programmers perspective

a.       Underlying networking coding – socket () etc etc – Hadoop

b.      Failover and load balancing – Hadoop does this – monitors the tasks and if these tasks fail Hadoop starts those tasks on other nodes

c.       Data spread across compute nodes – HDFS does this

Features of Hadoop –

1.       There is only one Job Tracker in the master node

2.       One Task Tracker per slave node

3.       Multiple task instances per node being tracked by the Task Tracker – running inside the individual JVMs – talk to each other over RPC

4.       So even if one task instance crashes or hangs due to some issue, that particular node doesn’t go unavailable for the other tasks

5.       One task instance per core