Thursday, 20 August 2015

What is driver class in Hadoop?

We don’t do any calculation part in driver program. The driver program mainly use to

1.       Set Hadoop Configuration.
2.       Set  Job.
3.       Include FileSystem.
4.       Specify the input file path and output file path.
5.       Make the jar file of the main class because Hadoop can only work on jar files  .
6.       Specify Map output key class, Map output value class.
7.       Specify Reduce output key class, Reduce output value class.
8.       Define the path of Mapper derived class and Reducer derived class.
9.       Waits until the job is completed.

Data types in Hadoop

There are various data types used in Hadoop such as
int -------->IntWritable

long -------->LongWritable

boolean -------->BooleanWritable

float -------->FloatWritable

byte --------> ByteWritable

We can use the following built-in data types as key and value

Text :This stores a UTF8 text
ByteWritable : This stores a sequence of bytes
VIntWritable and VLongWritable : These stores variable length integer and long values
Nullwritable:This is zero-length Writable type that can be used when you don’t want to use a key or value type

The following hadoop built-in collection data types can only be used as value types

ArrayWritable:This stores an array of values belonging to a Writable type.

Note: You may have question why we use Writable after every simple data types. Because in a big data world, structured objects need to be serialized to a byte stream for moving over the network or persisting to disk on the cluster...and then deserialized back again as needed. When you have vast amounts of data at like Facebook scale to store and move, your data need to be efficient and take as little space to store and time to move as possible.