Installation and Testing of the MrJob Library¶
The task here is to run the example script, seen in class, to count the words in a book using the MRJob library. Here, we will work on your own machine, locally (not on the Hadoop cluster).
Start by installing this library with the pip
command:
pip install mrjob
Create a directory on your hard drive and place the 2 scripts seen in class along with the dracula file:
Then test it with:
python wc_mrjob_1.py < dracula > resultInline.txt
Check, with the cat resultInline.txt
command, that the file contains the expected result. The symbols <
and >
are input and output redirection, respectively. This command is equivalent to:
python wc_mrjob_1.py -r inline < dracula > resultInline.txt
The inline mode specifies that the job is run in a single process for testing.
To run the job on multiple concurrent subprocesses (using different cores of your processor), you need to use the local mode:
python wc_mrjob_1.py -r local < dracula > resultLocal.txt
Now, test the wc_mrjob_2.py script.