Test MrJob Hadoop
Table of Contents
Testing MrJob on the Hadoop Cluster¶
We will now run the word count algorithm again on the Hadoop cluster.
Restart the Cluster¶
First, you need to restart the cluster we installed during the previous lab, with its NameNode and two DataNodes. First, launch the Docker Desktop application (to start the Docker daemons). Then, in a Terminal, type:
Next, enter the NameNode’s bash:
Finally, verify that HDFS is mounted properly by running:
Delete the output directory on HDFS:
Remember to run this last command between executions.
Running an algorithm on the cluster¶
- Navigate to the directory:
- Run the job on the Hadoop cluster:
This command runs the job on the Hadoop cluster, but it uses the input file (here, dracula) locally (i.e., the one in your current working directory), not the file stored on HDFS.
Here is the command to access the dracula file stored in the input directory on HDFS:
Note: It is also possible to run a job on EMR (Amazon Web Services) or Dataproc (Google Cloud).