Content for Chapter 2, Big Data Technologies

Here is a detailed content of class sessions for Chapter 2 of module Big Data computing technologies (4th semester), BSC in Data Science for Responsible Business (Centrale Lyon & EM Lyon).

Caution: the content of this page will evolve as the lessons progress.

Part 1. Linked Open Data (LOD) technology (6h) and project (7h).

Part 2. Hadoop framework, including HDFS and MrJob’ python library (8h).

  • Preparation (note that the installation of the software and container will require up to 3GB of free space on your hard drive!)

    • Install Docker on your personal machines by following this link.
    • Launch Docker (it will run in the background).
    • Open a Terminal (Windows Powershell for Windows users), and execute the following command:
  docker pull stephanederrode/docker-cluster-hadoop-spark-python-16:3.6

Part 3. Spark framework, using Pyspsark’ Python library (4h).

  • Teaching materials

  • Educational resources

    • Books

      • PySpark Cookbook, by Denny Lee ad Thomasz Drabas, 1st Edition, 2018, Packt Publishing.
      • Taming Big Data with Apache Spark and Python, by Frank Kane, 1st Edition, 2017, Packt Publishing.
      • Big Data Analysis with Python. Combine Spark and Python to unlock the powers of parallel Computing and Machie learning, by Ivan Marin, A,kit Shukla and Sarang VK, 1st Edition, 201O, Packt Publishing.
    • Videos

  • Practical work