Friday, December 27, 2013

What is HADOOP - Introduction



What is Hadoop !

Hadoop is a framework for running applications on large clusters built of commodity hardware. The Hadoop framework transparently provides applications both reliability and data motion. Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or reexecuted on any node in the cluster. In addition, it provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both Map/Reduce and the distributed file system are designed so that node failures are automatically handled by the framework.



What is MapReduce? 
Simplified Data Processing on Large Clusters 
- MapReduce is a set of code and infrastructure for parsing and building large data sets. A map function generates a key/value pair from the input data and this data is then reduced by a function to merges all values associated with equivalent keys. Programs are automatically parallelized and executed on a run-time system which manages partitioning the input data, scheduling execution and managing communication including recovery from machine failures 

- This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system!