...
/Design Refinements in MapReduce: Part I
Design Refinements in MapReduce: Part I
Let's introduce some execution-related improvements to MapReduce's design.
Real-world systems are rarely designed in one go—it often takes many iterations to improve the design. As initial versions of our system are deployed in production, we get usage data and possibly new insights. In this and the next lesson, we will improve many aspects of MapReduce design.
Ordering our refinements goes along with the execution flow of the system.
Input and output types
Let’s analyze the supported input and output types by the MapReduce
library.
Input types
By default, the MapReduce
library supports reading a limited set of various input data types. Each input type implementation automatically handles the data splitting into meaningful ranges for further processing by the Map
tasks.
Example
As we know, the data gets partitioned into key-value pairs before it is processed by the Map
tasks. The “text” mode input processes each line as a key-value pair, such that:
- The key is an offset in the input file.
- The value is the content of that line.
This mode ensures that the partitioning happens only at the line boundaries.
Support for new input types
Based on the desired functionality, the users can also define a new reader interface to add functionality for a new input type. For example, we can define a reader to read data from a database or a memory-mapped data structure.
Output types
The MapReduce
library also supports various output types by default, and similar to the input types, it also gives the functionality to define new output types.
Using custom types for data is a powerful extension that enables end programmers to read and write data from many different sources and sinks.
Partitioning function
The distribution of the intermediate data to each of the user-defined partitions is handled by the partitioning function. By default, the MapReduce
library provides a ...