MapReduce word count Program in Java

As the world of technology further enhances, we see an increase in applications and software in the market. Now these applications are running on some data. They are either collecting data from the users or other publicly available data. Big tech giants such as Meta own multiple applications like Facebook, Instagram, and Threads. These social media platforms have amassed huge audiences, which results in an even more extensive data set. On average, Facebook generates four new petabytes of data daily, and this estimate is increasing with the increase in their users. 

With this data, companies can extract different kinds of data, called data processing. This process has given rise to data analysts, who specialize in understanding and extracting key insights about specific aspects from this data. However, the challenge arises when it comes to processing this massive amount of data with speed and efficiency. This is where the MapReduce programming model takes over.

What is MapReduce?

MapReduce is a Java-based, distributed execution framework within the Apache Hadoop Ecosystem. Using MapReduce, we can concurrently split and process petabytes of data in parallel. This whole process is supported by two main tasks: mapping and reducing. This programming model is highly dependent on key-value pairs for processing. 

  • Mapping: This process takes an input in the form of key-value pairs and produces another set of intermediate key-value pairs after processing the input.

  • Reducing: This process takes the output from the map task and further processes it into even smaller and possibly readable chunks of data. However, the outcome is still in the form of key-value pairs.  

Basic MapReduce outline.
Basic MapReduce outline.

Applications of MapReduce

As mentioned earlier, MapReduce is a vital part of many aspects of tech companies. MapReduce helps in different types of classifications with specific domains and applications. These include the following:

  • Entertainment: MapReduce can assist in identifying the most popular movies by analyzing your preferences and viewing history by focusing on the user’s logs and clicks.

  • E-commerce: Major e-commerce providers like Amazon, Walmart, and eBay employ the MapReduce programming model to identify customers' favorite items based on their preferences and buying behavior.

  • Data Warehouse: MapReduce analyzes large volumes of data in data warehouses while applying specific business rules to gain valuable insights.

  • Fraud Detection: MapReduce can detect fraud by identifying patterns and analyzing business metrics through transaction analysis.

Examples of MapReduce

There are many instances where we can use MapReduce to simplify our data. These can be programs such as:

  • Finding mutual friends on Facebook

  • Finding the highest recorded global temperature for each year in a dataset

  • Analyzing customer purchase behavior of a specific application 

  • Finding the counts of the words in a data set

Now that we have a basic understanding of MapReduce, let's look at the code for the word count program using MapReduce.

Word count program example

Mapper

public static class TokenizerMapper extends Mapper<LongWritable, Text,Text,IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
private LongWritable key = new LongWritable();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
String[] words = line.split("\\s+");
for(String wordStr : words)
{
word.set(wordStr.trim());
if(!word.toString().isEmpty())
{
context.write(word, count);
}
}
}
}

Explanation

  • Line 2—4: These lines declare classes and variables for use in the mapper function.

  • Line 6: This line converts the Text value into a String value and assigns it to the line variable.

  • Line 7: This line splits the data in the line variable using the whitespace as a delimiterA character or sequence of characters that separates text into distinct parts. and assigns it to an array variable named words.

  • Line 8—15: This part of the code parses over the data in the words array.

  • Line 10: This line trims and sets the current word as the value of the word variable.

  • Line 11: This line checks if the word is not empty.

  • Line 13: This line emits the intermediate key-value pair, where the key is the word and the value is one (count of 1).

Reducer

public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}

Explanation

Line 7: This line starts a loop to iterate over each value associated with the input key.

Line 8: This line adds the value to the current sum, incrementing the count.

Line 10: This line sets the result variable with the final count of the word.

Line 11: This line emits the last key-value pair, where the key is the word, and the value is the total count of the word.

Complete code

import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
public static class TokenizerMapper extends Mapper<LongWritable, Text,Text,IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
private LongWritable key = new LongWritable();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
String[] words = line.split("\\s+");
for(String wordStr : words)
{
word.set(wordStr.trim());
if(!word.toString().isEmpty())
{
context.write(word, count);
}
}
}
}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path("/input"));
FileOutputFormat.setOutputPath(job, new Path("/output"));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

Summary

In this Answer, we talked about MapReduce, how it functions, its importance in this tech world, and the example of a word count program using MapReduce. We went into detail about the individual mapper and reducer program and how each works in this whole process.

Copyright ©2024 Educative, Inc. All rights reserved