subject

Modify the WordCount program so it outputs the wordcount for each distinct word in each file. So the output of this DocWordCount program should be of the form ‘wordfilename count’, where ‘’ serves as a delimiter between word and filename and tab serves as a delimiter between filename and count. Submit your source code in a file named DocWordCount. java.

Explanation: Consider two simple files file1.txt and file2.txt. $ echo "Hadoop is yellow Hadoop" > file1.txt $ echo "yellow Hadoop is an elephant" > file2.txt Running ‘DocWordCount. java’ on these two files will give an output similar to that below, where is a delimiter.

Output of DocWordCount. java

yellowfile2.txt 1

Hadoopfile2.txt 1

isfile2.txt 1

elephantfile2.txt 1

yellowfile1.txt 1

Hadoopfile1.txt 2

isfile1.txt 1

anfile2.txt 1

Initial code that needs to be modified:

package org. myorg;

import java. io. IOException;
import java. util. regex. Pattern;
import org. apache. hadoop. conf. Configured;
import org. apache. hadoop. util. Tool;
import org. apache. hadoop. util. ToolRunner;
import org. apache. log4j. Logger;
import org. apache. hadoop. mapreduce. Job;
import org. apache. hadoop. mapreduce. Mapper;
import org. apache. hadoop. mapreduce. Reducer;
import org. apache. hadoop. fs. Path;
import org. apache. hadoop. mapreduce. lib. input. FileInputFormat;
import org. apache. hadoop. mapreduce. lib. output. FileOutputFormat;
import org. apache. hadoop. io. IntWritable;
import org. apache. hadoop. io. LongWritable;
import org. apache. hadoop. io. Text;

public class WordCount extends Configured implements Tool {

private static final Logger LOG = Logger .getLogger( WordCount. class);

public static void main( String[] args) throws Exception {
int res = ToolRunner .run( new WordCount(), args);
System .exit(res);
}

public int run( String[] args) throws Exception {
Job job = Job .getInstance(getConf(), " wordcount ");
job. setJarByClass( this .getClass());

FileInputFormat. addInputPaths(job, args[0]);
FileOutputFormat. setOutputPath(job, new Path(args[ 1]));
job. setMapperClass( Map .class);
job. setReducerClass( Reduce .class);
job. setOutputKeyClass( Text .class);
job. setOutputValueClass( IntWritable .class);

return job. waitForCompletion( true) ? 0 : 1;
}

public static class Map extends Mapper {
private final static IntWritable one = new IntWritable( 1);
private Text word = new Text();

private static final Pattern WORD_BOUNDARY = Pattern .compile("\\s*\\b\\s*");

public void map( LongWritable offset, Text lineText, Context context)
throws IOException, InterruptedException {

String line = lineText. toString();
Text currentWord = new Text();

for ( String word : WORD_BOUNDARY .split(line)) {
if (word. isEmpty()) {
continue;
}
currentWord = new Text(word);
context. write(currentWord, one);
}
}
}

public static class Reduce extends Reducer {
@Override
public void reduce( Text word, Iterable counts, Context context)
throws IOException, InterruptedException {
int sum = 0;
for ( IntWritable count : counts) {
sum += count. get();
}
context. write(word, new IntWritable(sum));
}
}
}

ansver
Answers: 2

Another question on Computers and Technology

question
Computers and Technology, 23.06.2019 13:30
What is the primary difference between the header section of a document and the body? a. the body is displayed on the webpage and the header is not. b. the header is displayed on the webpage and the body is not. c. the tag for the body is self-closing, but the tags for the headers must be closed. d. the tag for the header is self closing, but the tag for the body must be closed.
Answers: 3
question
Computers and Technology, 23.06.2019 15:20
An ou structure in your domain has one ou per department, and all the computer and user accounts are in their respective ous. you have configured several gpos defining computer and user policies and linked the gpos to the domain. a group of managers in the marketing department need different policies that differ from those of the rest of the marketing department users and computers, but you don't want to change the top-level ou structure. which of the following gpo processing features are you most likely to use? a, block inheritance b, gpo enforcement c, wmi filtering d, loopback processing
Answers: 3
question
Computers and Technology, 23.06.2019 22:30
Lakendra finished working on her monthly report. in looking it over, she saw that it had large blocks of white space. what steps could lakendra take to reduce the amount of white space?
Answers: 3
question
Computers and Technology, 24.06.2019 22:00
Is the process of organizing data to reduce redundancy. a. normalization b. primary keying c. specifying relationships d. duplication
Answers: 1
You know the right answer?
Modify the WordCount program so it outputs the wordcount for each distinct word in each file. So the...
Questions
question
Social Studies, 06.10.2020 14:01
question
Spanish, 06.10.2020 14:01
question
History, 06.10.2020 14:01
question
Mathematics, 06.10.2020 14:01
question
Mathematics, 06.10.2020 14:01
question
Chemistry, 06.10.2020 14:01