Big Data Training ( Hadoop,HDFS, PIG, HIVE): July 2015

Wednesday, 22 July 2015

What is HDFS (Hadoop Distributed File System)

In Apache Hadoop project,The Hadoop Distributed File System (HDFS) is considered as a sub-project.

Hadoop File System was developed using distributed file system design. It is run on commodity hardware. Unlike other distributed systems, HDFS is highly fault tolerant and designed using low-cost hardware.

HDFS holds very large amount of data and provides easier access. To store such huge data, the files are stored across multiple machines. These files are stored in redundant fashion to rescue the system from possible data losses in case of failure. HDFS also makes applications available to parallel processing.

According to The Apache Software Foundation, the primary objective of HDFS is to store data reliably even in the presence of failures including NameNode failures, DataNode failures and network partitions.

The NameNode is a single point of failure for the HDFS cluster and a DataNode stores data in the Hadoop file management system.of the . This Apache Software Foundation project is designed to provide a fault-tolerant file system designed to run on commodity hardware.

HDFS uses a master/slave architecture in which one device (the master) controls one or more other devices (the slaves). The HDFS cluster consists of a single NameNode and a master server manages the file system namespace and regulates access to files.

Monday, 20 July 2015

Why Hadoop ?

Training from

Apache™ Hadoop® enables big data applications for both operations and analytics and is one of the fastest-growing technologies providing competitive advantage for businesses across the world

Considering next-generation data architecture, Hadoop is going to be a key component for such architecture.

Hadoop is an open-source framework that allows to store and process big data in a distributed environment across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.

It will provide a massively scalable distributed storage and processing platform. Hadoop enables businesses to build new data-driven applications while freeing up resources from existing systems. MapR is a production-ready distribution for Apache Hadoop.