Enonce TP MrJob

Table of Contents

Lab on the MRJob Library
- Exercise 1 - Querying a Sales File
- Exercise 2 - Lexical particularity

Lab on the MRJob Library¶

Two exercises are proposed. The work should be done locally, on your personal computer, using the MrJob library.

Exercise 1 - Querying a Sales File¶

We aim to gather information and calculate statistics on sales results stored in a large file.

Create a directory on your hard drive and place the sales file you’ll be working on:

purchases.txt: >4,000,000 lines!
purchases_10000.txt: 10,000-line extraction.

The file is organized into 6 columns:

Date (format YYYY-MM-DD)
Time (format hh:mm)
Purchase city
Purchase category (e.g., Book, Men’s Clothing, DVDs…)
Amount spent by the customer
Payment method (e.g., Amex, Cash, MasterCard…)

The columns are separated by a tab character. This character is represented by \t in Python. Example: print("before\tafter") will print the string “before after”.

Now that you’re equipped, you can develop map-reduce scripts using the MRJob library. Below is a list of questions that you will address in different .py files to keep track of your work.

What is the number of purchases made for each purchase category?
What is the total amount spent for each purchase category?
How much is spent in the city of San Francisco for each payment method?
In which city did the Women’s Clothing category generate the most money using Cash?
Add an original and complex query (i.e., number of STEPS > 1) of your choice on this file.

Make sure to test your programs in -r local mode to verify that your algorithms work well in parallel processing.

Exercise 2 - Lexical particularity¶

Given a file of words, write an MRJob script that detects the longest words containing only one vowel (from aeiouy), possibly repeated multiple times. For example, in a French dictionary, the word abracadabrant is the longest word (13 letters) containing only the vowel a (in 5 instances).

The output should display such words for each of the 6 vowels. The proposed algorithm should not take into account the presence of uppercase letters in the words.

Word Dictionaries For your intensive testing, you can use the following (English language) file: words_alpha.