Table of contents
Hadoop map-reduce big matrix-vector multiplication¶
Setup¶
If you wish to copy files from your operating system to the Docker container, here’s how to proceed (the example concerns the files from step #1 of this lab):
- Open a new Terminal (leave the first one open, as we will use it again!), and navigate to the working directory containing the wc_mapper.py and wc_reducer.py scripts that you modified during the first part. The following command will copy these two files to the Linux space, into the wordcount folder:
docker cp wc_mapper_improved.py hadoop-master:/root/TP_Hadoop/wordcount/
docker cp wc_reducer_improved.py hadoop-master:/root/TP_Hadoop/wordcount/
Remember this syntax, as it will be useful later to bring back the new Python scripts you’ve developed.
Now, go back to the first Terminal (don’t close the second one, it will be useful later), and check with the command ls
that the two files are indeed present. You now need to:
- Make these two scripts runnable:
chmod +x wc_mapper_improved.py
chmod +x wc_reducer_improved.py
- Note (for Windows users only): You also need to convert the line break characters, as they differ between Windows and Linux. For each text file (e.g., fichier.py) that you bring back from your machine to the Linux account, you should run:
dos2unix fichier.py
Exercise¶
- Navigate to the matrice directory on the Linux account:
cd ~/TP_Hadoop/matrice/
- Run the matrice.py program to generate two matrices and save them on your hard drive (matriceA.txt and matriceB.txt). Both matrices will be displayed along with the result of their multiplication:
python matrice.py
Then, upload these two files to HDFS in the input directory:
hadoop fs -put matriceA.txt input
hadoop fs -put matriceB.txt input
Question: Write a mapper and a reducer script to perform the multiplication of these two matrices, using the matriceA.txt and matriceB.txt files. You can refer to the following link.
Important note: It’s difficult to edit text files directly on the Linux account since we don’t have access to a graphical interface (you can always use the nano editor, e.g.,
nano script.py
, but it’s not very user-friendly). The solution is to write the scripts on your operating system (using tools you’re familiar with, like Spyder or VSCode) and transfer the file using thedocker cp...
command (as shown above), e.g.:
docker cp multmat_mapper.py hadoop-master:/root/TP_Hadoop/matrice/
docker cp multmat_reducer.py hadoop-master:/root/TP_Hadoop/matrice/
- To run your job on the Hadoop cluster, adapt the command we saw earlier:
hadoop jar $STREAMINGJAR -files multmat_mapper.py,multmat_reducer.py \
-mapper multmat_mapper.py -reducer multmat_reducer.py \
-input input/matriceA.txt,input/matriceB.txt -output sortie
Check that the result obtained by your algorithm is the same as the one displayed by the matrice.py program.