Install Docker Hadoop

Table of contents

Installation of Hadoop via Docker

Installation of Hadoop via Docker¶

The steps for installing Hadoop via Docker are largely adapted from the page by Lilia Sfaxi, which itself is based on the GitHub project by Kai Liu.

Installation of Docker¶

To install the Docker software, please follow the instructions available here, depending on your operating system (check the System requirements to ensure that your machine is compatible). If your machine is too old or has limited disk space or RAM, there’s a good chance the installation won’t work. If that’s the case: - either you can work with a neighbor, - or you can directly go to the second part of the lab and perform the exercises locally (without Hadoop).

From now on, and for the rest of the labs, you will need to remember to start Docker (which will run in the background).

Setup of nodes¶

We will use, throughout this lab, three containers representing one master node (the Namenode) and two slave nodes (the Datanodes).

From a Terminal, download the Docker image stored on Docker Hub (the image size is > 3.3 GB!):

docker pull stephanederrode/docker-cluster-hadoop-spark-python-16:3.6

This container contains a Linux/Ubuntu distribution, along with the necessary libraries to use Hadoop and Spark. It also includes a python3.x distribution.

Create the three containers from the downloaded image. To do this:

a. Create a network to connect the three containers:

docker network create --driver=bridge hadoop

b. Create and launch the three containers (the -p flags map the host machine ports to the container’s ports).

docker run -itd --net=hadoop -p 9870:9870 -p 8088:8088 -p 7077:7077 -p 16010:16010 -p 9999:9999 --name hadoop-master --hostname hadoop-master stephanederrode/docker-cluster-hadoop-spark-python-16:3.6

docker run -itd -p 8040:8042 --net=hadoop --name hadoop-slave1 --hostname hadoop-slave1 stephanederrode/docker-cluster-hadoop-spark-python-16:3.6

docker run -itd -p 8041:8042 --net=hadoop --name hadoop-slave2 --hostname hadoop-slave2 stephanederrode/docker-cluster-hadoop-spark-python-16:3.6

Notes:

On some machines, the first command may not execute correctly. The error is likely due to port 9870 already being in use by another application. If this is the case, you can remove this port from the first command:

docker run -itd --net=hadoop -p 8088:8088 -p 7077:7077 -p 16010:16010 -p 9999:9999 --name hadoop-master --hostname hadoop-master stephanederrode/docker-cluster-hadoop-spark-python-16:3.6

Port 9999 will be used during the lab on the Spark streaming library.

Enter the Namenode¶

Enter the hadoop-master container to start using it:

docker exec -it hadoop-master bash

The result of this execution should be:

root@hadoop-master:~#

This is the shell or bash (Linux/Ubuntu) of the master node. - The ls command, which lists files and directories in the current directory, should display the following directories and files (among others):

TP_Hadoop TP_Spark hdfs

Note: These configuration steps need to be done only once. To restart the cluster (after, for example, shutting down and restarting your computer), simply:

Start the Docker Desktop application, which launches the Docker daemons.

Run the following command:

docker start hadoop-master hadoop-slave1 hadoop-slave2

You can then enter the Namenode:

docker exec -it hadoop-master bash