Table of contents



Map-reduce, stand-alone mode

We will here run the map-reduce algorithm that counts the words in a text file, locally, i.e., without currently utilizing the parallelism offered by the Hadoop framework. The program consists of two Python scripts that are called sequentially as described below. The focus here is mainly on understanding the algorithmic logic.

Note: You should know how to open a Terminal on your machine, regardless of the operating system. On Windows 10 (and later versions), you can use the Powershell program (installed by default), which is very similar to the Terminal on Linux and Mac OS X. There are video tutorials available to learn the basic commands, including this one. Finally, note that Powershell includes git, wget, and ssh: three tools we will be using during the labs.


Setup

To retrieve the scripts, create a working directory and download the two Python scripts as well as the Dracula book in text format into this directory:

Execution of the Python scripts

Notes for Windows users: If you are using Windows, replace python with python.exe. Be sure to make this change everywhere in the instructions. If the command python.exe doesn’t work, it means the location of the python.exe program is not known to your machine, so you need to specify it by modifying the PATH environment variable. You can follow the instructions given in Section Method 2: Manually add Python to Windows Path of this link.

  • Open a Terminal and navigate to your working directory (using the cd command).

  • Run the following command and observe the result:

cat dracula | python wc_mapper.py
  • Then run the next command and observe the result:
cat dracula | python wc_mapper.py | sort
  • Finally run the full command and observe the result:
cat dracula | python wc_mapper.py | sort | python wc_reducer.py

It is possible to redirect the output to a file, rather than to the screen:

cat dracula | python wc_mapper.py | sort | python wc_reducer.py > result.txt

Important note: The first line of all your Python scripts must be

#!/usr/bin/env python3

This line indicates that if the script is to be executed, it should be run with python3. To check that this line exists in previous scripts, type:

more wc_mapper.py

Use q key on your keyboard to quit more command.

Improving the Scripts

By carefully looking at the output of the previous command, we notice two “issues”: - The case of the words is taken into account. For example, the words “Youth” and “youth” are considered as two different words. - Punctuation marks are also taken into account. For example, the words “Youth”, “Youth,” and “Youth.” are considered as three different words.

Copy the two scripts into two new files:

cp wc_mapper.py wc_mapper_improved.py
cp wc_reducer.py wc_reducer_improved.py

and modify these new files so that the word counting is no longer sensitive to these two issues.