As the world of technology further enhances, we see an increase in applications and software in the market. Now these applications are running on some data. They are either collecting data from the users or other publicly available data. Big tech giants such as Meta own multiple applications like Facebook, Instagram, and Threads. These social media platforms have amassed huge audiences, which results in an even more extensive data set. On average, Facebook generates four new petabytes of data daily, and this estimate is increasing with the increase in their users.
With this data, companies can extract different kinds of data, called data processing. This process has given rise to data analysts, who specialize in understanding and extracting key insights about specific aspects from this data. However, the challenge arises when it comes to processing this massive amount of data with speed and efficiency. This is where the MapReduce programming model takes over.
MapReduce is a Java-based, distributed execution framework within the Apache Hadoop Ecosystem. Using MapReduce, we can concurrently split and process petabytes of data in parallel. This whole process is supported by two main tasks: mapping and reducing. This programming model is highly dependent on key-value pairs for processing.
Mapping: This process takes an input in the form of key-value pairs and produces another set of intermediate key-value pairs after processing the input.
Reducing: This process takes the output from the map task and further processes it into even smaller and possibly readable chunks of data. However, the outcome is still in the form of key-value pairs.
As mentioned earlier, MapReduce is a vital part of many aspects of tech companies. MapReduce helps in different types of classifications with specific domains and applications. These include the following:
Entertainment: MapReduce can assist in identifying the most popular movies by analyzing your preferences and viewing history by focusing on the user’s logs and clicks.
E-commerce: Major e-commerce providers like Amazon, Walmart, and eBay employ the MapReduce programming model to identify customers' favorite items based on their preferences and buying behavior.
Data Warehouse: MapReduce analyzes large volumes of data in data warehouses while applying specific business rules to gain valuable insights.
Fraud Detection: MapReduce can detect fraud by identifying patterns and analyzing business metrics through transaction analysis.
There are many instances where we can use MapReduce to simplify our data. These can be programs such as:
Finding mutual friends on Facebook
Finding the highest recorded global temperature for each year in a dataset
Analyzing customer purchase behavior of a specific application
Finding the counts of the words in a data set
Now that we have a basic understanding of MapReduce, let's look at the code for the word count program using MapReduce.
public static class TokenizerMapper extends Mapper<LongWritable, Text,Text,IntWritable>{private final static IntWritable one = new IntWritable(1);private Text word = new Text();private LongWritable key = new LongWritable();public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {String line = value.toString();String[] words = line.split("\\s+");for(String wordStr : words){word.set(wordStr.trim());if(!word.toString().isEmpty()){context.write(word, count);}}}}
Explanation
Line 2—4: These lines declare classes and variables for use in the mapper function.
Line 6: This line converts the Text
value into a String
value and assigns it to the line
variable.
Line 7: This line splits the data in the line
variable using the whitespace as a words
.
Line 8—15: This part of the code parses over the data in the words
array.
Line 10: This line trims and sets the current word as the value of the word
variable.
Line 11: This line checks if the word
is not empty.
Line 13: This line emits the intermediate key-value pair, where the key is the word and the value is one
(count of 1).
public static class IntSumReducerextends Reducer<Text,IntWritable,Text,IntWritable> {private IntWritable result = new IntWritable();public void reduce(Text key, Iterable<IntWritable> values,Context context) throws IOException, InterruptedException {int sum = 0;for (IntWritable val : values) {sum += val.get();}result.set(sum);context.write(key, result);}}
Line 7: This line starts a loop to iterate over each value associated with the input key.
Line 8: This line adds the value to the current sum, incrementing the count.
Line 10: This line sets the result
variable with the final count of the word.
Line 11: This line emits the last key-value pair, where the key is the word, and the value is the total count of the word.
import java.io.IOException;import java.util.StringTokenizer;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.Reducer;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;public class WordCount {public static class TokenizerMapper extends Mapper<LongWritable, Text,Text,IntWritable>{private final static IntWritable one = new IntWritable(1);private Text word = new Text();private LongWritable key = new LongWritable();public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {String line = value.toString();String[] words = line.split("\\s+");for(String wordStr : words){word.set(wordStr.trim());if(!word.toString().isEmpty()){context.write(word, count);}}}}public static class IntSumReducerextends Reducer<Text,IntWritable,Text,IntWritable> {private IntWritable result = new IntWritable();public void reduce(Text key, Iterable<IntWritable> values,Context context) throws IOException, InterruptedException {int sum = 0;for (IntWritable val : values) {sum += val.get();}result.set(sum);context.write(key, result);}}public static void main(String[] args) throws Exception {Configuration conf = new Configuration();Job job = Job.getInstance(conf, "word count");job.setJarByClass(WordCount.class);job.setMapperClass(TokenizerMapper.class);job.setCombinerClass(IntSumReducer.class);job.setReducerClass(IntSumReducer.class);job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class);FileInputFormat.addInputPath(job, new Path("/input"));FileOutputFormat.setOutputPath(job, new Path("/output"));System.exit(job.waitForCompletion(true) ? 0 : 1);}}
In this Answer, we talked about MapReduce, how it functions, its importance in this tech world, and the example of a word count program using MapReduce. We went into detail about the individual mapper and reducer program and how each works in this whole process.