Table of Contents



Testing MrJob on the Hadoop Cluster

We will now run the word count algorithm again on the Hadoop cluster.

Restart the Cluster

First, you need to restart the cluster we installed during the previous lab, with its NameNode and two DataNodes. First, launch the Docker Desktop application (to start the Docker daemons). Then, in a Terminal, type:

docker start hadoop-master hadoop-slave1 hadoop-slave2

Next, enter the NameNode’s bash:

docker exec -it hadoop-master bash

Finally, verify that HDFS is mounted properly by running:

hadoop fs -ls

Delete the output directory on HDFS:

hadoop fs -rm -r -f sortie

Remember to run this last command between executions.

Running an algorithm on the cluster

  • Navigate to the directory:
cd ~/TP_Hadoop/wordcount
  • Run the job on the Hadoop cluster:
python wc_mrjob_1.py -r hadoop < dracula > resultHadoop.txt

This command runs the job on the Hadoop cluster, but it uses the input file (here, dracula) locally (i.e., the one in your current working directory), not the file stored on HDFS.

Here is the command to access the dracula file stored in the input directory on HDFS:

python wc_mrjob_1.py -r hadoop hdfs:///user/root/input/dracula > resultHadoop.txt

Note: It is also possible to run a job on EMR (Amazon Web Services) or Dataproc (Google Cloud).