Lab: Practice map-reduce
Author
- Stéphane Derrode & Lamia Derrode, Centrale Lyon, Dpt Mathématiques & Informatique
Objectives
This lab follows the course on the open-source framework called Hadoop, developed and maintained by the Apache Foundation. The objective is to practice the map-reduce way to design scalables data algorithm with Hadoop.
Table of contents
Python map and reduce functions, exercices¶
Exercise 1 - Compute the weighted average¶
You are given a list of grades with weights:
grades = [
(14, 2),
(10, 1),
(16, 3)
]
Each tuple represents (grade, weight).
-
Use
mapto compute the products grade × weight -
Use
reduceto compute: -
the weighted sum
-
the sum of the weights
Compute the final weighted average. Expected result: 14.6666.
Exercise 2 - Compute total duration of events of type “login” only¶
We consider a list of events described by tuples:
events = [
("Alice", "login", 3),
("Bob", "login", 1),
("Alice", "logout", 2),
("Bob", "logout", 4),
("Alice", "login", 1),
("Bob", "login", 2),
]
Each event has the form: (user, event_type, duration).
Objective
Using only the functions map and functools.reduce (and without any for loop, list comprehension, or defaultdict), compute the following dictionary:
{
"Alice": 6,
"Bob": 3
}
The value associated with each user corresponds to the total duration of events of type “login” only.
Exercise 3 - Compute the total duration in seconds.¶
We consider a list of durations expressed as strings:
times = [
"01:15:30",
"00:45:15",
"02:00:00",
"00:30:45"
]
Each duration follows the format: HH:MM:SS.
Questions
-
Using
map, convert each duration into a total number of seconds -
Using
reduce, compute the total duration in seconds -
(Bonus) convert this total duration back into the HH:MM:SS format
Hadoop map-reduce, stand-alone mode¶
We will here run the map-reduce algorithm that counts the words in a text file, locally, i.e., without currently utilizing the parallelism offered by the Hadoop framework. The program consists of two Python scripts that are called sequentially as described below. The focus here is mainly on understanding the algorithmic logic.
Note: You should know how to open a Terminal on your machine, regardless of the operating system. On Windows 10 (and later versions), you can use the Powershell program (installed by default), which is very similar to the Terminal on Linux and Mac OS X. There are video tutorials available to learn the basic commands, including this one. Finally, note that Powershell includes git, wget, and ssh: three tools we will be using during the labs.
Setup¶
To retrieve the scripts, create a working directory and download the two Python scripts as well as the Dracula book in text format into this directory:
Execution of the Python scripts¶
Notes for Windows users: If you are using Windows, replace
pythonwithpython.exe. Be sure to make this change everywhere in the instructions. If the commandpython.exedoesn’t work, it means the location of the python.exe program is not known to your machine, so you need to specify it by modifying the PATH environment variable. You can follow the instructions given in Section Method 2: Manually add Python to Windows Path of this link.
-
Open a Terminal and navigate to your working directory (using the
cdcommand). -
Run the following command and observe the result:
cat dracula | python wc_mapper.py
- Then run the next command and observe the result:
cat dracula | python wc_mapper.py | sort
- Finally run the full command and observe the result:
cat dracula | python wc_mapper.py | sort | python wc_reducer.py
It is possible to redirect the output to a file, rather than to the screen:
cat dracula | python wc_mapper.py | sort | python wc_reducer.py > result.txt
Important note: The first line of all your Python scripts must be
#!/usr/bin/env python3
This line indicates that if the script is to be executed, it should be run with python3. To check that this line exists in previous scripts, type:
more wc_mapper.py
Use q key on your keyboard to quit more command.
Improving the Scripts¶
By carefully looking at the output of the previous command, we notice two “issues”: - The case of the words is taken into account. For example, the words “Youth” and “youth” are considered as two different words. - Punctuation marks are also taken into account. For example, the words “Youth”, “Youth,” and “Youth.” are considered as three different words.
Copy the two scripts into two new files:
cp wc_mapper.py wc_mapper_improved.py
cp wc_reducer.py wc_reducer_improved.py
and modify these new files so that the word counting is no longer sensitive to these two issues.