@InterfaceAudience.Public
@InterfaceStability.Stable
public interface InputFormat<K,V>
InputFormat
describes the input-specification for a Map-Reduce job.The Map-Reduce framework relies on the InputFormat
of the job to:
InputSplit
s, each of which is then assigned to an individual Mapper
.RecordReader
implementation to be used to glean input records from the logical InputSplit
for processing by the Mapper
.The default behavior of file-based InputFormat
s, typically sub-classes of FileInputFormat
, is to split the input into logical InputSplit
s based on the total size, in bytes, of the input files. However, the FileSystem
blocksize of the input files is treated as an upper bound for input splits. A lower bound on the split size can be set via mapreduce.input.fileinputformat.split.minsize.
Clearly, logical splits based on input-size is insufficient for many applications since record boundaries are to be respected. In such cases, the application has to also implement a RecordReader
on whom lies the responsibilty to respect record-boundaries and present a record-oriented view of the logical InputSplit
to the individual task.
InputSplit
, RecordReader
, JobClient
, FileInputFormat
Modifier and Type | Method and Description |
---|---|
RecordReader<K,V> | getRecordReader(InputSplit split, JobConf job, Reporter reporter) Get the RecordReader for the given InputSplit . |
InputSplit[] | getSplits(JobConf job, int numSplits) Logically split the set of input files for the job. |
InputSplit[] getSplits(JobConf job, int numSplits) throws IOException
Each InputSplit
is then assigned to an individual Mapper
for processing.
Note: The split is a logical split of the inputs and the input files are not physically split into chunks. For e.g. a split could be <input-file-path, start, offset> tuple.
job
- job configuration.numSplits
- the desired number of splits, a hint.InputSplit
s for the job.IOException
RecordReader<K,V> getRecordReader(InputSplit split, JobConf job, Reporter reporter) throws IOException
RecordReader
for the given InputSplit
.It is the responsibility of the RecordReader
to respect record boundaries while processing the logical split to present a record-oriented view to the individual task.
split
- the InputSplit
job
- the job that this split belongs toRecordReader
IOException
Copyright © 2024 Apache Software Foundation. All rights reserved.