Install and Configure Cloudera

Why Yet Another Install and Configuration for Cloudera

Okay so the first question you might be wondering is why do we need yet more documentation on how to Install and Configure Cloudera?  Well the answer is simple–all of the other documentation is either too detailed or too scant.  It is very hard to find something that can get you started with just enough information to do what you need in a minimum of time.  The Cloudera documentation is of course the definitive source, but unfortunately there is a lot of it, and you can spend a good chunk of time sorting through it all.  I know this from first-hand experience!  So what I have done is to put together a checklist that I use when doing a CDH installations.  I hope you find it helpful.

cloudera_380-300x225

Installing Cloudera

Prerequisites

The first step to Install and Configure Cloudera is to it.  Of course, there will be some prerequisite configuration steps on the nodes that will be used in the cluster.

1) Make sure selinux is disabled on all nodes in the cluster

su –
/etc/sysconfig/selinux
SELINUX=disabled

2) Make sure iptables is turned off on all nodes in the cluster

su –
service iptables stop
chkconfig iptables off

3) Make sure sshd is running on all the nodes

su –
service sshd status
chkconfig –list | grep sshd

If sshd does not have levels 2,3,4,5 turned on, then need to turn them on with

chkconfig –level 2345 sshd on

Then to restart it and make sure it comes up at boot

service sshd start
chkconfig sshd on

4) Ensure hosts are fully qualified

This is only required on a multi-node cluster.  With a single-node cluster, you can get away with using localhost.  However, if you going to use multiple nodes then each of the nodes needs to be fully qualifed and referencable by each other.

In order to set up the fully qualified hosts, there are really only a few files that need to be considered.  The contents of these will not be discussed here, because that is beyond the scope of this document.

/etc/hosts
/etc/resolv.conf
/etc/sysconfig/network
/etc/sysconfig/network-scripts/<interface-id>

5) Allow password-less SSH

It will be necessary to allow the master node (node where the NameNode and JobTracker will be running) to have password less SSH to the other slave nodes.

su –
ssh-keygen
cd .ssh
cp id_rsa.pub authorized_keys
chmod 640 authorized_keys
scp id_rsa.pub root@slave01:master_key
scp id_rsa.pub root@slave02:master_key
scp id_rsa.pub root@slave03:master_key

login to each slave and go to ~/.ssh/authorized_keys

catmaster_key >> .ssh/authorized_keys
chmod 640 .ssh/authorized_keys

6) Install the Oracle Java 6 SDK on all the nodes in the Cluster

a) Check if OpenJDK is installed and if so uninstall it

su –
rpm –qa | grep jdk
rpm –e [each of above]

b) Install the Java SDK

su –
download the rpm (ie: 1.6.0_31 which is the certified version for Java 6)
chmod +x jdk-6u31-linux-x64-rpm.bin
./jdk-6u31-linux-x64-rpm.bin

CMF Install

1) Download and install Cloudera Manager

a) Go to the primary server, where you want CM to run

b) Download CMF

curl -O http://archive.cloudera.com/cm4/installer/latest/cloudera-manager-installer.bin

Alternatively, this could be accessed via a browser as well, if for some reason curl is not desirable.

c) Launch CMF

chmod +x cloudera-manager-installer.bin
./cloudera-manager-installer.bin

Note: Cloudera Manager could fail if not all of the prerequisites have been met.  If this you should be informed of an error condition, and a reference to a log file for more information

2) If the install proceeds normally, all you will need to do is follow the prompts and select the nodes and services.  For the typical system it will only be necessary install the hdfs and mapreduce services.  The other services are not immediately needed and could drain resources, especially in a single server installation.  Besides, the other services can be added later, if needed.  I would recommend keeping it simple at first and add build up as you need it.

3) Upon completion, you will then be able to access the CMF console

http://localhost:7180
admin/admin

 

ConfiguringCloudera

The next step to Install and Configure Cloudera will be the configuration.  The initial install creates a the basic configuration and a bare-bones CDH cluster; however, some additional configuration steps are usually needed.

 

1) Permissions are always a tricky, and can take a little while to sort out.  The difficulty can arise when you need to share files between the Hadoop related processes and your application user.  For example, let's suppose we are going to use a J2EE web application to provide a nice clean interface into our cluster.  The web application would be launched as a particular user and it will need to interact with Hadoop processes, in particular hdfs and mapred.  Of course, the situation would probably be reversed.

First you would need to create your appuser account and make sure it is in the three CDH generated groups (mapred, hdfs and hadoop).

su –
useradd -U -G mapred,hdfs,hadoop -m -u <user_id> appuser
passwd <some_hard_password>

Next, it will be necessary to establish group relationships between the appuser, hadoop, mapred and hdfs users to allow for these users to be able to permissions to each other

add appuser to the Cloudera created groups

usermod -G hadoop,hdfs,mapred -a appuser

allow for group access to the appuser home directory

chmod 770 /home/appuser

add hdfs, mapred and hadoop to the appuser group if not done at time of user create

usermod -G appuser -a hdfs
usermod -G appuser -a mapred
usermod -G appuser -a hadoop

 

2) JVM options in Cloudera are a little different from Apache Hadoop.  Instead of putting those options in the hadoop_env.sh, they will need to entered via the CM console

For NameNode

Navigate to Select Services -> click on hdfs1 -> Configuration->View And Edit ->NameNode -> Advanced -> Java Configuration Options for NameNode

For DataNode

Navigate to Select Services -> click on hdfs1 -> Configuration->View And Edit ->DataNode -> Advanced -> Java Configuration Options for DataNode

For JobTracker

Navigate to Select Services -> click on mapreduce -> Configuration->View And Edit ->JobTracker -> Advanced -> Java Configuration Options for JobTracker

 

3) Need to set the supergroup in CM.  The installer creates the default group, hadoop, but does not configure itself to use it.

Navigate to Services->hdfs1->Configuration->View and Edit->Service Wide->Security

Find the Superuser Group property (dfs.permissions.supergroup)

Change the value to hadoop

 

4) Modify the XML parser options for Map Reduce jobs.  This is a common problem for jobs that need to use XML.  This needs to be explicitly set or there could be a failure.

Navigate to Services->mapreduce1->Configuration->View and Edit->TaskTracker->Performance

Find the property MapReduce Child Java Opts Base (mapred.child.java.opts )

change it to -Djavax.xml.parsers.DocumentBuilderFactory=com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl

 

5) Next, you need to deploy the Cloudera configuration.  This will make the cluster configuration available to client applications.  The CDH configuration is internal to CMF, unless it is specifically deployed.

All Services->Cluster

Click on Action at Cluster level and select Deploy Client Configuration

You may want to double-check the /etc/hadoop/conf/hdfs-site.xml and /etc/hadoop/conf/mapred-site.xml to make sure all of your configuration options where created