Content for Chapter 2, Big Data Technologies

Here is a detailed content of class sessions for Chapter 2 of module Big Data computing technologies (4th semester), BSC in Data Science for Responsible Business (Centrale Lyon & EM Lyon).

Caution: the content of this page will evolve as the lessons progress.

Part 1. Linked Open Data (LOD) technology (6h) and project (7h).¶

Teaching materials
- Slides
- iPython notebook with examples from the course (and other examples).
Educational resources
- Book (available at the Centrale Lyon library)
  - Learning SparQL, by Bob Ducharme, 2nd Edition, 2011, O’Reilly. (pdf copies can be found on the Internet!)
- SparQL language reference
  - SparQL language from W3C.
  - SparQL By Example: The Cheat Sheet.
  - SparQL query-validator. As a bonus, it re-indents and improves the readability of your code!
- Videos
  - Big Data in 5 minutes
  - What is Linked Open Data? (Introduction for students)
  - What is Linked Data ? (A short non-technical introduction to Linked Data)
  - SPARQL in 11 minutes
Practical work
- LOD with SparQL technology.
LOD Project (in groups of 3 students)
- Project expectations description.

Part 2. Hadoop framework, including HDFS and MrJob’ python library (8h).¶

Preparation (note that the installation of the software and container will require up to 3GB of free space on your hard drive!)
- Install Docker on your personal machines by following this link.
- Launch Docker (it will run in the background).
- Open a Terminal (Windows Powershell for Windows users), and execute the following command:

  docker pull stephanederrode/docker-cluster-hadoop-spark-python-16:3.6

Teaching materials
- Slides
Educational resources
- Videos
  - Hadoop In 5 Minutes
  - What Is Hadoop? . 30 minutes introduction for beginners
  - HDFS Tutorial For Beginners. 43 minutes
Practical works
- Practice Hadoop framework and HDFS.
- Hadoop map-reduce with MrJob librairy.