Table of Contents



Lab on the MRJob Library

Two exercises are proposed. The work should be done locally, on your personal computer, using the MrJob library.

Exercise 1 - Querying a Sales File

We aim to gather information and calculate statistics on sales results stored in a large file.

Create a directory on your hard drive and place the sales file you’ll be working on:

The file is organized into 6 columns:

  • Date (format YYYY-MM-DD)
  • Time (format hh:mm)
  • Purchase city
  • Purchase category (e.g., Book, Men’s Clothing, DVDs…)
  • Amount spent by the customer
  • Payment method (e.g., Amex, Cash, MasterCard…)

The columns are separated by a tab character. This character is represented by \t in Python. Example: print("before\tafter") will print the string “before     after”.

Now that you’re equipped, you can develop map-reduce scripts using the MRJob library. Below is a list of questions that you will address in different .py files to keep track of your work.

  1. What is the number of purchases made for each purchase category?
  2. What is the total amount spent for each purchase category?
  3. How much is spent in the city of San Francisco for each payment method?
  4. In which city did the Women’s Clothing category generate the most money using Cash?
  5. Add an original and complex query (i.e., number of STEPS > 1) of your choice on this file.

Make sure to test your programs in -r local mode to verify that your algorithms work well in parallel processing.

Exercise 2 - Lexical particularity

Given a file of words, write an MRJob script that detects the longest words containing only one vowel (from aeiouy), possibly repeated multiple times. For example, in a French dictionary, the word abracadabrant is the longest word (13 letters) containing only the vowel a (in 5 instances).

The output should display such words for each of the 6 vowels. The proposed algorithm should not take into account the presence of uppercase letters in the words.

Word Dictionaries For your intensive testing, you can use the following (English language) file: words_alpha.