Wednesday, December 22, 2010

How Hadoop's HDFS input file get mapped to MapReduce map?

Though quite late, but around a week back I started on Hadoop, it took couple of days for me (with the help of my team members) to set up a local Hadoop installation on my system using cygwin.
I wrote an example Map Reduce, in which Mapper processes a given file to calculate some GPS displacement for a person based on lattitute and longitude information and finally Reducer figures out the maximum displacement on the combined displacement list.
Every thing went well, I got stuck at a point where I was unable to understand how KeyIn and ValueIn are mapped from HDFS file read. How can I make customize which goes in key and what goes in value, hadoop wiki states,

"It is not necessary for the InputFormat to generate both meaningful keys and values. For example, the default output from TextInputFormat consists of input lines as values and somewhat meaningless line start file offsets as keys - most applications only use the lines and ignores the offsets."

Hence, it depends on the specific implementation of RecordReader, in case of TextInputFormat we use LineRecordReader which makes meaningless LongWritable (Hadoop's serialization format, Writable's implementation for Long datatype) keys as an input to Mapper. KeyValueLineRecordReader in KeyValueTextInputFormat (not in hadoop-core-0.20.2, I can see in mapreduce trunk), reads the text file and seperates key and value by /t (tab) seperator in the input file.