Wordcount Hadoop

Table of contents

Map-Reduce with Hadoop
Monitoring the Cluster and Jobs

Map-Reduce with Hadoop¶

Given the previous installation, we are going to leverage the parallelism of your processor, which typically consists of 8 cores and can therefore execute 8 instructions in parallel. Among these 8 cores, we will use only 3 (1 for the Namenode and 2 for the Datanodes), leaving the other cores available for the machine to handle other tasks.

First, run the following command:

/usr/local/hadoop/bin/hdfs namenode -format

This formats the HDFS disk. This command should only be executed the first time you enter the Namenode.

Starting the Hadoop Daemon¶

The first thing to do in the Terminal connected to hadoop-master is to start the Hadoop daemons:

start-dfs.sh
start-yarn.sh

The output of running this script will look like this:

Starting namenodes on [hadoop-master]
hadoop-master: Warning: Permanently added 'hadoop-master' (ED25519) to the list of known hosts.
Starting datanodes
localhost: Warning: Permanently added 'localhost' (ED25519) to the list of known hosts.
Starting secondary namenodes [hadoop-master]
hadoop-master: Warning: Permanently added 'hadoop-master' (ED25519) to the list of known hosts.
2025-01-25 08:16:07,545 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Starting resourcemanager
Starting nodemanagers

Preparing files for Wordcount¶

Important Notes

The Terminal points to a Linux account with its own file storage mode (called ext3). You can create directories, add files, and delete them using standard Linux commands (mkdir, rm…), but note that there is no integrated text editor in the container we’ve just installed (for writing python scripts). Therefore, we’ll use a trick described below.

This space is where we will store the Python map-reduce scripts that will be executed by Hadoop. On the other hand, large files, those for which we deploy processing algorithms, will be stored on part of your hard drive managed by HDFS (Hadoop Distributed File System). With commands beginning with “hadoop fs - + command”, you can create directories on HDFS, copy files from Linux to HDFS, and bring files from HDFS back to Linux.

Let yourself be guided…

From the Terminal, navigate to the following directory:

cd TP_Hadoop/wordcount/

The ls command will show that the directory contains, among other things:
- The file dracula: contains the book Dracula in text format.
- The Python scripts: wc_mapper.py and wc_reducer.py.
Place the dracula file into HDFS (after creating a directory to receive it):

hadoop fs -mkdir -p input
hadoop fs -put dracula input

Check that the file has been uploaded:

hadoop fs -ls input

You should see something like this:

Found 1 items
-rw-r--r--   2 root supergroup     844505 2025-01-25 08:18 input/dracula

Now, we are ready to run our first map-reduce script on Hadoop!

Wordcount with Hadoop¶

From the first Terminal, we will run the scripts that count the number of words in the Dracula book file.

First, store the path to the library that allows programming with python in a system variable:

export STREAMINGJAR='/usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-3.4.1.jar'

Remark: Hadoop MapReduce operates with Java, so we need a library capable of converting Python instructions into Java instructions. This is the role of the hadoop-streaming-3.x.x.jar library (known as a wrapper).

Next, run the Hadoop job with the following command (copy the entire block of instructions and paste it into the terminal):

hadoop jar $STREAMINGJAR -files wc_mapper.py,wc_reducer.py \
  -mapper wc_mapper.py -reducer wc_reducer.py \
  -input input/dracula -output sortie

The -files option copies the necessary files so they can be executed on all the nodes of the cluster.

If the command doesn’t work properly, you can try this one:

hadoop jar $STREAMINGJAR -files /root/wordcount/wc_mapper.py,/root/wordcount/wc_reducer.py \
  -mapper "python wc_mapper.py" -reducer "python wc_reducer.py" \
  -input input/dracula -output sortie

The result of the word count is stored in the sortie folder under HDFS. You can view its content by running the command:

hadoop fs -ls sortie/

This will give you something like:

Found 2 items
-rw-r--r--   2 root supergroup          0 2025-01-25 08:20 sortie/_SUCCESS
-rw-r--r--   2 root supergroup     236411 2025-01-25 08:20 sortie/part-00000

The first file, _SUCCESS, is an empty file (0 bytes!), and its mere presence indicates that the job completed successfully. The second file, part-00000, contains the result of the algorithm. You can view the last few lines of the file with the following command:

hadoop fs -tail sortie/part-00000

or view the entire file with:

hadoop fs -cat sortie/part-00000

The result should be exactly the same as in the first part of the TP.

Important note: Between two runs, either use a new name for the sortie folder or delete it with the following command:

hadoop fs -rm -r -f sortie

The presence of only one file part-0000x indicates that only one node was used for the reducer (the number of nodes is estimated by the Namenode). It is possible to force the number of reducers:

hadoop jar $STREAMINGJAR -D mapred.reduce.tasks=2 \
  -files wc_mapper.py,wc_reducer.py \
  -input input/dracula -output sortie \
  -mapper wc_mapper.py -reducer wc_reducer.py

The command:

hadoop fs -ls sortie/

will now give:

Found 3 items
-rw-r--r--   2 root supergroup          0 2025-01-25 08:22 sortie/_SUCCESS
-rw-r--r--   2 root supergroup     117444 2025-01-25 08:22 sortie/part-00000
-rw-r--r--   2 root supergroup     118967 2025-01-25 08:22 sortie/part-00001

Monitoring the Cluster and Jobs¶

Hadoop offers several web interfaces to observe the behavior of its different components. You can view these pages locally on your machine thanks to the -p option of the docker run command we previously used.

Port 9870 allows you to view information about your Namenode.
Port 8088 allows you to view information about the Resource Manager (called Yarn) and monitor the behavior of various jobs.

Once your cluster is up and running, use your preferred browser to navigate to http://localhost:9870. Note: During installation, some students had to remove the port mapping, so they won’t be able to view the page, which should look like:

interface 9870

Take the time to browse through the menus and check the information displayed.

You can also monitor the progress and results of your jobs (MapReduce or others) by going to http://localhost:8088. Again, take the time to explore the menus and observe the information presented.