Companies are generating (or sourcing) more data than they could ever crunch with the traditional BI tools. Not saying that traditional BI guys are not capable, but sometimes it just doesnt make sense to be paying for licenses, hardware, power and other infrastructure for jobs which run once in a week (or sometimes month). So what is hadoop? It is a combination of following three things
1. HDFS A distributed file system which splits the individual files into multiple 64k blocks and spreads the chunks into at least 3 different data nodes. Why? So that if one node fails it can be regenerated from other nodes. AND allows parallel access to all these chunks.
2. MapReduce A new programming paradigm, which divides the jobs two-ways Parallel processing of data + two-step break down of the problem (Map & Reduce). Map tasks access the data parallel on all the nodes and generate an intermediate output. Reduce tasks access this intermediate data (again in parallel) and produce the final output. Very specific use cases log file analysis, web search index, anything with lots of text to be analyzed
3. Hadoop Its the orchestration layer which does the following
a. Monitors all the nodes
b. Monitors the processes of map and reduce for ex. reduce cannot start till map tasks have ended. If map process fails, it allocates the jobs to other map processes etc.
c. Restores data in case one node fails
What are the business benefits
1. The hardware infrastructure can be standard off-the-shelf commodity servers and standard networking switching with flat network topology
2. No complex clustering software
3. No costly BI software
4. The complexity of the software reduces significantly because the code is not dependent on the hardware topology. It is costly to build high availability in each application
5. If done on Amazon MapReduce, one can start with zero capex and use capacity purely on-demand.
6. All open source
Technical advantages
vis-ΰ-vis other solutions (paid)
1. In-built high-availability self-healing 3 levels
a. Partial failure The tasks running on failed nodes are picked up and executed by the other available nodes
b. Data recovery The data lost from failed nodes is recovered using the data on available nodes
c. Individual recoverability The nodes being added to the cluster are able to join the group
i. As an active working node
ii. Without any need to restart the group
d. Consistency Tasks failing and recovering on other nodes do not cause
i. Data corruption
ii. System-wide hang or non-determinism
e. Scalability
i. Increased load should lead to graceful decline of performance and not outright failure
ii. Increased resources should lead to proportional increase in capacity
2. General purpose
3. Low entry barrier no schema related complexities, just load the data and ready to go
4. Mostly for processing lots of data Petabytes
5. Designed for failures commodity hardware will fail and code written ensures HA (how to calculate MTBF)
6. Computations come to data rather than the traditional way of bringing the data to the compute
7. Scalability is horizontal add compute nodes without worrying about the software.
What is abstracted
1. From the programmers perspective
a. Underlying networking coding socket () etc etc Hadoop
b. Failover and load balancing Hadoop does this monitors the tasks and if these tasks fail Hadoop starts those tasks on other nodes
c. Data spread across compute nodes HDFS does this
Features of Hadoop
1. There is only one Job Tracker in the master node
2. One Task Tracker per slave node
3. Multiple task instances per node being tracked by the Task Tracker running inside the individual JVMs talk to each other over RPC
4. So even if one task instance crashes or hangs due to some issue, that particular node doesnt go unavailable for the other tasks
5. One task instance per core