Installation and Testing of the MrJob Library

The task here is to run the example script, seen in class, to count the words in a book using the MRJob library. Here, we will work on your own machine, locally (not on the Hadoop cluster).

Start by installing this library with the pip command:

pip install mrjob

Create a directory on your hard drive and place the 2 scripts seen in class along with the dracula file:

Then test it with:

python wc_mrjob_1.py < dracula > resultInline.txt

Check, with the cat resultInline.txt command, that the file contains the expected result. The symbols < and > are input and output redirection, respectively. This command is equivalent to:

python wc_mrjob_1.py -r inline < dracula > resultInline.txt

The inline mode specifies that the job is run in a single process for testing.

To run the job on multiple concurrent subprocesses (using different cores of your processor), you need to use the local mode:

python wc_mrjob_1.py -r local < dracula > resultLocal.txt

Now, test the wc_mrjob_2.py script.