This section provides a quick overview of the architecture of Hadoop. Hadoop consists of two parts – HDFS(Hadoop Distributed File System), MapReduce Part. The figure below gives a view of the architecture of Hadoop showing the main two parts and their components and how they communicate with each other.
MapReduce is a programming model and software framework first developed by Google (Google’s MapReduce paper submitted in 2004). Intended to facilitate and simplify the processing of vast amounts of data in parallel on large clusters of commodity hardware in a reliable, fault-tolerant manner.
MapReduce is a massively scalable, parallel processing framework that works in tandem with HDFS. With MapReduce and Hadoop, compute is executed at the location of the data, rather than moving data to the compute location; data storage and computation coexist on the same physical nodes in the cluster. MapReduce processes exceedingly large amounts of data without being affected by traditional bottlenecks like network bandwidth by taking advantage of this data proximity.
The primary objective of Map/Reduce is to split the input data set into independent chunks that are processed in a completely parallel manner .There are two method named map() and reduce() are used for performing tasks .The Hadoop MapReduce framework sorts and shuffles the outputs of the maps, which are then input to the reduce tasks and then generate the actual output. Typically, both the input and the output of the job are stored in a file system.
HDFS (Hadoop Distributed File System)
- HDFS stands for Hadoop Distributed file System. It is a distributed file system designed to hold very large amounts of data .
- HDFS manages the data using three types of servers- Name Node , Secondary Name Node , Data Node.
- Files are stored in a redundant fashion across multiple machines to ensure their durability to failure and high availability to very parallel applications.
- HDFS should store data reliably. If individual machines in the cluster malfunction, data should still be available.
should provide fast, scalable access to this information. It should be possible
to serve larger number of clients by simply adding more machines to the