Configuring Hadoop cluster with multiple hosts in Cloudera CDH4

I have started loving Cloudera 🙂 . So far it is the best administrative tool I have ever used. The deployment tool (Cloudera manager) is robust and user friendly and I believe it has some inbuilt AI to troubleshoot your hadoop cluster more easily.
Thanks to the Cloudera team for their continuous effort for such robust system development.

After installing my first CDH4 hadoop distribution in a single node, I decided to add a new host to the cluster. Cloudera manager is a very smart and easy to understand web tool which you can use to deploy new host in your hadoop environment.

Current configuration of my hadoop cluster

Before starting the new host configuration check whether you can ping that targeted host by hostname otherwise your installation will fail.

To mitigate this issue you can follow the below steps

1. From terminal window enter the below command
Sudoedit /etc/hosts
2. An editor will open. Enter your target hostname and IP as below and save

Hadoop cluster configuration

1. Go to the computer in which you have install CDH4
2. Open browser and enter the url – http://localhost:7180 to open Cloudera manager
3. I am still using the default user name and password which CDH4 will configure while installing the Cloudera manager (User Name: admin, Password: admin)

4. Click on Hosts menu from the top bar of the Cloudera Manager portal

5. Click on Add New Hosts to Cluster

6. Click on continue
7. Enter the IP address of the new host to which you want to install CDH4. You need to install open SSH (https://help.ubuntu.com/10.04/serverguide/openssh-server.html) otherwise the system will not be able to find the targeted host. Click on Search button

8. Select the hostname from the list and click on Continue

9. Select Latest Release of CDH4 and leave the other option as it is and click on continue

10. Enter the root password of the targeted host and click on Continue

11. The system will start installing on that targeted host

12. Based on your network speed it will take some time to complete the installation since it will download the required packages from web. You can see the current status by clicking on Details link


13. After completing the installation you can now see multiple hosts in your hadoop cluster

Add new instance in a service

1. Go to Services and click on the service to which you want to add instance that you have configured newly

2. Click on Instances link and click on Add

3. Select the role you want to assign on the targeted host and click on Continue

4. The role will be added for that service as below

5. Start the Service by selecting that and clicking on Start from Action menu

How to copy file to Hadoop file system

Due to some security restriction you will see access denied error message while transferring file to HDFS from your local file system. You can follow the below steps to copy file from local system to Hadoop file system.

At first check the current directory status
1. Open terminal
2. Enter the below command

hadoop fs -ls /

To create a directory under user folder enter the below command (here root is the directory name)

sudo -u hdfs hadoop fs -mkdir /user/root

After creating the directory, assign permission to that directory so that root user can copy data to hadoop file system.
sudo -u hdfs hadoop fs -chown root:root /user/root

If you use the normal hadoop mkdir command without using sudo command you may see the below permission denied error.

To copy file enter the below command (assuming that you have a file in your Documents folder name Test.txt and you want to copy it in hadoop user/root folder)

hadoop fs -copyFromLocal Documents/Test.txt hdfs://localhost/user/root/Test.txt

Here you may see the connection refused error if you are using Cloudera hadoop version (CDH4)

In this case you should use host name instead of localhost as below (here smr01 is my hostname)

hadoop fs -copyFromLocal Documents/Test.txt hdfs://smr01/user/root/Test.txt

Now you can see that your file has been copied in HDFS

My First Hadoop Installation with Cloudera CDH4

Though it will be a bold step for me to talk about Big Data in this early stage of my learning but I believe my last few days experience on learning Big Data will be helpful for those newcomers in this Big Data world.

Big Data is a hot topic now and there are tons of resources available in the Internet, but it is very easy to get confused in your early stage of learning – meaning, where to start?

And I am sure for those millions of Windows Geeks like me!! It will be bit confusing to start learning this new topic. Though Microsoft has just started working on this topic with Hortonworks but they are far behind than the other Big Data solution provider exists in the market (especially solution provided in Windows environment).

Where to start

1. At first you should have some conceptual idea on Big Data technology like Hadoop and MapReduce and for this reason reading Hadoop in Practice and Hadoop The Definitive Guide are must read books.
2. Have a 64bit PC or Server. In my case I have used Windows 2008 R2 Hyper-V and installed Ubuntu 64 bit version as a virtual machine
3. My recommendation is to install the GUI mode Linux in your test environment

Installing Hadoop

Hadoop is an open source project from Apache foundation and there are several Hadoop distributions available on the market. You can either install Hadoop manually by downloading it from Apache (which is very complex process to configure) or you can use other automated Hadoop distribution available in the market.

Which Hadoop distribution to use?
Most of us who are dealing with Windows for a long time and avoided Linux like me, I am sure they will be interested to look into HortonWorks Hadoop distribution which has a confusing and yet dimmed tag with Microsoft. Though Microsoft is claiming they are working on Big Data with SQL Server 2012, I have some doubt whether they have a clear plan yet.

On the other hand Cloudera distribution is much more defined and they have the remarkable contribution in Hadoop development.
That is why my suggestion is to go with Cloudera.

Steps to follow
1. After installing Linux go to Cloudera.com and download Cloudera manager (Install CDH4 Automatically via Cloudera Manager) free edition with the installation documentation. Though there is a fully configured VM available but my suggestion is install it using the bin file so that at least you will have some idea what it is really doing.
2. This distribution is only compatible with 64 bit Linux version thus 64 bit Linux is must. And you should have a stable internet connection for installation purpose
3. After downloading the cloudera-manager-installer.bin, Open the terminal window and enter the below command. You should read the Cloudera Manager Free Edition Installation Guide – here I have only summarized the important topics.

sudo ./cloudera-manager-installer.bin

4. After completing the preliminary step the system will prompt you as above. Click on Close and open your browser and enter http://localhost:7180

5. Enter the user name and password (both are admin)

6. Select the host name for CDH cluster installation and click on continue

7. The system will start installing as below

8. After successful installation click on Continue

9. The System will show you which services it will install

10. For single node cluster choose All Services option

11. Review the configuration summery and click on Continue

12. The system will start configuring the services

13. While installing Hive the system will prompt you for Database details. In my case I have used embedded database. Please click on Test Connection button otherwise the Continue button will not be enabled.

14. After successful installation the system will show you the page as below

15. As you can see the above image my Hive installation was failed due to some unknown reason. And this is the first time Google failed to suggest me what is the reason behind this failure. 😉
what you can do in this situation is –
a) Delete the hive1 and its dependent services. You can do it by clicking the Action button available in the right side of the window
b) Click on the Action button associated with Cluster1-CDH4 and click on Add a Service

c) Select the service you want to install and click continue. The system will download the service installer from internet and will automatically install the service.

Well, seems my CDH4 cluster is in good health now

Online resource you can read

1. Big Data University
2. Coursera for online free courses
3. Apache Hadoop
4. Learn how to learn Hadoop

This is my learning experience on Hadoop so far, yet to learn many more things. Surely, I will share my learning progress in the coming days. And by the way, as I have mentioned at the beginning of my writing, the recommendations/comments I have expressed here are just my personal thought and any advice or correction will be very much appreciated.