This section provides a quick overview of the architecture
of Hadoop. Hadoop consists of two parts – HDFS(Hadoop Distributed File System),
MapReduce Part. The figure below gives a view of the architecture
of Hadoop showing the main two parts and their
components and how they communicate with each other.
MapReduce Part
MapReduce is a programming model and software framework
first developed by Google (Google’s MapReduce paper submitted in 2004). Intended
to facilitate and simplify the processing of vast amounts of data in parallel
on large clusters of commodity hardware in a reliable, fault-tolerant manner.
MapReduce is a massively scalable, parallel processing
framework that works in tandem with HDFS. With MapReduce and Hadoop, compute is
executed at the location of the data, rather than moving data to the compute
location; data storage and computation coexist on the same physical nodes in
the cluster. MapReduce processes exceedingly large amounts of data without
being affected by traditional bottlenecks like network bandwidth by taking
advantage of this data proximity.
The primary objective of Map/Reduce is to split the input
data set into independent chunks that are processed in a completely parallel
manner .There are two method named map() and reduce() are used for performing tasks .The Hadoop MapReduce framework sorts
and shuffles the outputs of the maps,
which are then input to the reduce tasks and then generate the actual output.
Typically, both the input and the output of the job are stored in a file
system.
HDFS (Hadoop
Distributed File System)
- HDFS stands for Hadoop Distributed file System. It is a distributed file system designed to hold very large amounts of data .
- HDFS manages the data using three types of servers- Name Node , Secondary Name Node , Data Node.
- Files are stored in a redundant fashion across multiple machines to ensure their durability to failure and high availability to very parallel applications.
- HDFS should store data reliably. If individual machines in the cluster malfunction, data should still be available.
- HDFS
should provide fast, scalable access to this information. It should be possible
to serve larger number of clients by simply adding more machines to the
cluster.
No comments:
Post a Comment