If you have ever picked up a book about Hadoop then I am sure that you spent a good deal of time on setting up your own installation of Hadoop. It is not that the instructions on installing Hadoop is lacking in the book but rather that when you have a background in Microsoft technology all the ins and outs of setting up things on Linux can be quite strange and confusing at times.

Take for instance the installation of Oracle Java on Ubuntu Linux. Ubuntu comes with Open JDK installed and you need to use Oracle Java when you want to run Hadoop.  If you go down the manual installation way it might seem very complicated compared to the installer that you would normally run on you Windows system of choice. Unless you are familiar with all the resources made available by the community for getting things like this done easily you might spend a considerable amount of time googling how to install the prerequisites and only a small amount of time actually installing Hadoop and getting it up and running. 

So in the spirit of saving time and helping make the process a little easier I have decided to create this step by step guide of how to install Hadoop in Pseudo-distributed mode on a single Ubuntu 12.04 machine. The idea is to help you install all the prerequisites easily and get you to a working Hadoop installation as fast as possible. Since this guide is for installing Hadoop in Pseudo-distributed mode it is not aimed at setting up a Hadoop cluster in production. This is simply aimed at getting an instance of Hadoop up and running for you to experiment with.

So lets get started by logging into your Ubuntu machine with the user account that you will use to run Hadoop with. (If you are setting up Hadoop to experiment with it and learn then this can be your own user account.)

Download Hadoop

Navigate to https://hadoop.apache.org/ and click on "Download" link that you will see on the landing page.

Once you have clicked on "Download" you will be taken to the Download section of this page and you will need to click on "Releases".

On the "Releases" page click on "Download".

This should take you to the download section on that page. Note here the different versions of Hadoop that is available and note the "current stable version", this is the version we will be downloading. (At the time of writing this the current stable version was 1.2.1). Click on "Download a release now!".

Choose the mirror site from which you want to download the files.

Select the "stable/" folder on the page displayed.

Inside the "stable" folder download the binary tarball and save it to disk. This is the ".bin.tar.gz" file and in our case the filename is "hadoop-1.2.1-bin.tar.gz". (Do not download the ".deb" file as it will give you warnings with the Ubuntu package manager and will not install properly)

The Hadoop binary tarball should now be downloaded to the "Downloads" folder for the user with which you have logged into the system.

Installing Oracle Java

Installing Oracle Java on your Linux machine can seem like one of the most daunting tasks when you are just starting out in the world of Linux. Anyone that tells me that manually installing Oracle Java on a Ubuntu machine is a simple process needs to get professional help. (Yes I have said it!) 

Fortunately the guys at WEB UPD8 have created a PPA repository that takes the pain out of this process. 

I will use the steps that they provide on their site to install Oracle Java 7. If you are prompted for a password at any time during running these commands provide the password of a Super User on the system, which since this is a dev installation should be your account's password. 

Launch a terminal (shortcut being CRTL+ALT+T) and run the following commands in order.

Install the PPA Repository by typing 
sudo add-apt-repository ppa:webupd8team/java
in your terminal and pressing "enter" when prompted if you want to add the PPA to your system.

Once the PPA has been added and run 
sudo apt-get update
to update the package manager.

Once the package manager has been updated run
sudo apt-get install oracle-java7-installer 
and select "Y" and press enter when asked if you want to continue.

You will be prompted to accept the "Oracle Binary Code License Agreement" so just hit "enter" when you are prompted with the below screen.

On the next screen you will be accepting the license agreement so just select "yes" and hit enter.

 You should now see the Oracle Java installation proceeding with the downloading of the files.

Once the installation is complete you will be returned to the terminal prompt.

We are almost done with the installation of Java, the only thing that remains is to set the environment variables. Run the following command in the terminal
sudo apt-get install oracle-java7-set-default
this will install a package that will set your java environment variables.

Once all of this has been completed you can run the following commands to check that you are now running Oracle Java 7 on your machine.
java -version 

javac -version

Both of these commands should give you similar output as in the screenshots. (At the time of writing this Java 7 Update 40 was the latest version of Java 7).

Installing Hadoop

Our next step is to install Hadoop. Most books and guides normally install Hadoop to the folder "/usr/local/hadoop" and then create a sim link for hadoop in the "/opt" folder. I find this to be a massive pain if you are simply trying to get an instance of Hadoop up and running for you to experiment with. The biggest pain with this has to do with permissions, and all the extra permission settings that you need to have configured to allow those directories to be used.

To make it easier for you to get up and running with Hadoop quickly we will install Hadoop to a folder in the "home" folder of the user that we are logged into the machine with. Keep in mind that this is NOT best practice and should only be used to get a instance up and running for your own personal use. 

First we need to create an "apps" directory in our "home" directory. We can do this either by using the terminal or we could use the GUI. Since we have spent a lot of time using the Terminal so far, and most people always complain that Linux seems to require terminal use for everything, we will opt to go the GUI route to add a bit of variety to our installation. 

On the Ubuntu desktop launch "Nautilus", the file manager equivalent of "Windows Explorer".

This should take you straight to your home folder where you should "Right click" and select "Create New Folder".

Call the new folder "apps" (lower case, remember that all folders and paths in Linux are case sensitive). Once you are done you should have something that looks like the screenshot below.

Next we need to extract our Hadoop tarball into the "apps" directory. Double click on "Downloads" to go into the "Downloads" directory. 

In Downloads double click on the "hadoop-1.2.1-bin.tar.gz" file to open the archive manager.

In the Archive Manager click on Extract and then Navigate to the "home/apps" directory and extract the content of the tarball there.

Once the files have been extracted click on "Quit" when prompted by the Archive Manager.

We now have Hadoop extracted to our machine but before we configure it we need to ensure that we setup SSH to allow access that does not require a password. We will get to this next.

Configuring SSH

Hadoop uses SSH for its processes to communicate with each other on one or more machines and we need to ensure that the user that we are using to install Hadoop with can access all these machines through SSH without a password. We do this by creating a SSH Key Pair which has an empty phrase.

Before we create the SSH Key Pair we need to ensure that the SSH server is installed on our machine. Open the terminal (CTRL+ALT+T or click on the Terminal Icon). In the terminal type the following command.
sudo apt-get install openssh-server
Provide your password when prompted and select "Y" if prompted to install the openssh server to your machine. If OpenSSH Server is already installed on your machine you will be notified of it and you will not need to do anything extra.

Now we can create the SSH Key Pair. In the terminal type the following command.
When prompted which folder to store the file in just keep the default by hitting "Enter".

When prompted for the passphrase leave it empty by just hitting "Enter" twice. (the second enter is to confirm the empty passphrase)

You should now be returned to the terminal prompt with the file created and the key's random image displayed.

We are almost done with setting up things for SSH. All that is left to do is to copy the new  public key to the list of authorised keys. We do this by copying the ".ssh/idrsa.pub" file created in the previous step to the ".ssh/authorizedkeys" file. (UK English users be aware NOT to spell authorised with an "S", it must be with a "Z" as in the US English)

Type the following command into your terminal.
cp .ssh/idrsa.pub .ssh/authorizedkeys
Once this is done we can test that all is working by SSH-ing to our local machine. We should not be prompted for a password and should simply be connected via SSH. If you are prompted for a password ensure that you have copied the public key to the "authorizedkeys" spelled with a "Z" in the previous step. Run the following command in your terminal.
ssh localhost
When prompted if you are sure you want to continue connecting due to the authenticity of host "localhost" not being established type "yes" and enter.

You should then be logged into the system without any prompts for a username and password.

Close the terminal.

Configuring Hadoop

We are now ready to configure our Hadoop installation to run in pseudo-distributed mode. This is done by changing settings or adding settings in the XML configuration files of Hadoop.

First we need to create the directory in which Hadoop will store its log files. We will use "var/lib/hadoop" directory for these purposes. Open a terminal and type the following command into it. (provide your password when prompted.)
sudo mkdir /var/lib/hadoop
This will create the directory. Next we need to make this folder writeable  by any user. Type the following command into the terminal.
sudo chmod 777 /var/lib/hadoop

Next we can now modify the "core-site-xml" file in the "home/apps/hadoop-1.2.1/conf" directory. This can be done by opening up Nautilus, which we used when we extracted the Hadoop tarball, and navigating to the folder and right clicking on the file and selecting "Open With Text Editor".

In the file we need to add the following between the tags.


hadoop.tmp.dir/var/lib/hadoopOnce this has been added the file should look like the screenshot below.

Save and close the file.

Next we need to modify the "hdfs-site.xml" file. Add the following lines into it between the tags.

Save and close the file.

Our last modification is on the "mapred-site.xml" file. Add the following lines into it between the tags.

Save and close the file.

Finally we need to set the JAVA
HOME property in the "hadoop-env.sh" file. Open the file and uncomment the line reading "#export JAVAHOME=/usr/lib/j2sdk1.5-sun" by removing the "#" in front of it.

Change this to be "export JAVA
HOME=/usr/lib/jvm/java-7-oracle" to point to the location of our java 7 installation.

Save and close the file.

We are now ready to start using Hadoop. we will make extensive use of executables that exist in our "/apps/hadoop-1.2.1/" directory and in order to save us some typing in the command line it is best to add the path to these executables to the environment variables of the shell. We can do this by either typing the commands to add the environment variables each time we open the terminal or we can create a ".sh" shell script that we can run from the terminal before using Hadoop. I suggest creating the shell script as it makes life much easier.

In Nautilus right click in your "home" directory and create an empty document.

Call the file something like "hadoopsetenv.sh". Make sure you give it the ".sh" extension. Double click the file and add the following text to it.
export HADOOPPREFIX=/home/hadoop/apps/hadoop-1.2.1/export PATH=$HADOOPPREFIX/bin:$PATH
Replace the "/hadoop/" section in "/home/hadoop/apps/hadoop-1.2.1/" with the name of the user that you are using.

Save and close the file.

Now we can run the file every time that we go to the terminal wanting to work with Hadoop and we will not need to type out the full path for our executables.

Next we need to format our Hadoop "namenode". Launch the terminal and type the following commands.
source hadoopsetenv.shhadoop namenode -format
 You should see output similar to the below screenshot when the namenode format has completed.

Now we are ready to start hadoop. Type the following command in the terminal to start the components necessary for HDFS.
Your output should look similar to the screenshot below.

Next we need to start the components necessary for map reduce. Type the following into the terminal.
Your output should look similar to the screenshot below.

Finally we need to check that all the java processes we have started are running. Type the following into the terminal.
 Your output should look similar to the screenshot below.

Hadoop is now up and running on your machine.


I hope this guide will be of help to anyone that wants to setup a instance of Hadoop for them to experiment with, especially for people more familiar with a Windows environment.