Content for Chapter 2, Big Data Technologies
Here is a detailed content of class sessions for Chapter 2 of module Big Data computing technologies (4th semester), BSC in Data Science for Responsible Business (Centrale Lyon & EM Lyon).
Caution: the content of this page will evolve as the lessons progress.
Part 1. Linked Open Data (LOD) technology (6h) and project (7h).¶
Teaching materials
- Slides
- iPython notebook with examples from the course (and other examples).
Educational resources
Book (available at the Centrale Lyon library)
- Learning SparQL, by Bob Ducharme, 2nd Edition, 2011, O’Reilly. (pdf copies can be found on the Internet!)
SparQL language reference
- SparQL language from W3C.
- SparQL By Example: The Cheat Sheet.
- SparQL query-validator. As a bonus, it re-indents and improves the readability of your code!
- Big Data in 5 minutes
- What is Linked Open Data? (Introduction for students)
- What is Linked Data ? (A short non-technical introduction to Linked Data)
- SPARQL in 11 minutes
Practical work
LOD Project (in groups of 3 students)
Part 2. Hadoop framework, including HDFS and MrJob’ python library (8h).¶
Preparation (note that the installation of the software and container will require up to 3GB of free space on your hard drive!)
- Install Docker on your personal machines by following this link.
- Launch Docker (it will run in the background).
- Open a Terminal (Windows Powershell for Windows users), and execute the following command:
docker pull stephanederrode/docker-cluster-hadoop-spark-python-16:3.6
Teaching materials
Educational resources
- To be done!
- Hadoop In 5 Minutes
- What Is Hadoop? . 30 minutes introduction for beginners
- HDFS Tutorial For Beginners. 43 minutes
Practical works
Part 3. Spark framework, using Pyspsark’ Python library (4h).¶
Teaching materials
Educational resources
- PySpark Cookbook, by Denny Lee ad Thomasz Drabas, 1st Edition, 2018, Packt Publishing.
- Taming Big Data with Apache Spark and Python, by Frank Kane, 1st Edition, 2017, Packt Publishing.
- Big Data Analysis with Python. Combine Spark and Python to unlock the powers of parallel Computing and Machie learning, by Ivan Marin, A,kit Shukla and Sarang VK, 1st Edition, 201O, Packt Publishing.
Practical work