Table of Contents
Lab on the MRJob Library¶
Two exercises are proposed. The work should be done locally, on your personal computer, using the MrJob library.
Exercise 1 - Querying a Sales File¶
We aim to gather information and calculate statistics on sales results stored in a large file.
Create a directory on your hard drive and place the sales file you’ll be working on:
- purchases.txt: >4,000,000 lines!
- purchases_10000.txt: 10,000-line extraction.
The file is organized into 6 columns:
- Date (format
YYYY-MM-DD
) - Time (format
hh:mm
) - Purchase city
- Purchase category (e.g., Book, Men’s Clothing, DVDs…)
- Amount spent by the customer
- Payment method (e.g., Amex, Cash, MasterCard…)
The columns are separated by a tab character. This character is represented by \t
in Python. Example: print("before\tafter")
will print the string “before after”.
Now that you’re equipped, you can develop map-reduce scripts using the MRJob library. Below is a list of questions that you will address in different .py
files to keep track of your work.
- What is the number of purchases made for each purchase category?
- What is the total amount spent for each purchase category?
- How much is spent in the city of San Francisco for each payment method?
- In which city did the Women’s Clothing category generate the most money using Cash?
- Add an original and complex query (i.e., number of STEPS > 1) of your choice on this file.
Make sure to test your programs in -r local
mode to verify that your algorithms work well in parallel processing.
Exercise 2 - Lexical particularity¶
Given a file of words, write an MRJob script that detects the longest words containing only one vowel (from aeiouy), possibly repeated multiple times. For example, in a French dictionary, the word abracadabrant is the longest word (13 letters) containing only the vowel a (in 5 instances).
The output should display such words for each of the 6 vowels. The proposed algorithm should not take into account the presence of uppercase letters in the words.
Word Dictionaries For your intensive testing, you can use the following (English language) file: words_alpha.