Why Yet Another Install and Configuration for Cloudera
Okay so the first question you might be wondering is why do we need yet more documentation on how to Install and Configure Cloudera? Well the answer is simple–all of the other documentation is either too detailed or too scant. It is very hard to find something that can get you started with just enough information to do what you need in a minimum of time. The Cloudera documentation is of course the definitive source, but unfortunately there is a lot of it, and you can spend a good chunk of time sorting through it all. I know this from first-hand experience! So what I have done is to put together a checklist that I use when doing a CDH installations. I hope you find it helpful.
The first step to Install and Configure Cloudera is to it. Of course, there will be some prerequisite configuration steps on the nodes that will be used in the cluster.
1) Make sure selinux is disabled on all nodes in the cluster
2) Make sure iptables is turned off on all nodes in the cluster
service iptables stop
chkconfig iptables off
3) Make sure sshd is running on all the nodes
service sshd status
chkconfig –list | grep sshd
If sshd does not have levels 2,3,4,5 turned on, then need to turn them on with
chkconfig –level 2345 sshd on
Then to restart it and make sure it comes up at boot
service sshd start
chkconfig sshd on
4) Ensure hosts are fully qualified
This is only required on a multi-node cluster. With a single-node cluster, you can get away with using localhost. However, if you going to use multiple nodes then each of the nodes needs to be fully qualifed and referencable by each other.
In order to set up the fully qualified hosts, there are really only a few files that need to be considered. The contents of these will not be discussed here, because that is beyond the scope of this document.
5) Allow password-less SSH
It will be necessary to allow the master node (node where the NameNode and JobTracker will be running) to have password less SSH to the other slave nodes.
cp id_rsa.pub authorized_keys
chmod 640 authorized_keys
scp id_rsa.pub root@slave01:master_key
scp id_rsa.pub root@slave02:master_key
scp id_rsa.pub root@slave03:master_key
login to each slave and go to ~/.ssh/authorized_keys
catmaster_key >> .ssh/authorized_keys
chmod 640 .ssh/authorized_keys
6) Install the Oracle Java 6 SDK on all the nodes in the Cluster
a) Check if OpenJDK is installed and if so uninstall it
rpm –qa | grep jdk
rpm –e [each of above]
b) Install the Java SDK
download the rpm (ie: 1.6.0_31 which is the certified version for Java 6)
chmod +x jdk-6u31-linux-x64-rpm.bin
1) Download and install Cloudera Manager
a) Go to the primary server, where you want CM to run
b) Download CMF
Alternatively, this could be accessed via a browser as well, if for some reason curl is not desirable.
c) Launch CMF
chmod +x cloudera-manager-installer.bin
Note: Cloudera Manager could fail if not all of the prerequisites have been met. If this you should be informed of an error condition, and a reference to a log file for more information
2) If the install proceeds normally, all you will need to do is follow the prompts and select the nodes and services. For the typical system it will only be necessary install the hdfs and mapreduce services. The other services are not immediately needed and could drain resources, especially in a single server installation. Besides, the other services can be added later, if needed. I would recommend keeping it simple at first and add build up as you need it.
3) Upon completion, you will then be able to access the CMF console
The next step to Install and Configure Cloudera will be the configuration. The initial install creates a the basic configuration and a bare-bones CDH cluster; however, some additional configuration steps are usually needed.
1) Permissions are always a tricky, and can take a little while to sort out. The difficulty can arise when you need to share files between the Hadoop related processes and your application user. For example, let's suppose we are going to use a J2EE web application to provide a nice clean interface into our cluster. The web application would be launched as a particular user and it will need to interact with Hadoop processes, in particular hdfs and mapred. Of course, the situation would probably be reversed.
First you would need to create your appuser account and make sure it is in the three CDH generated groups (mapred, hdfs and hadoop).
useradd -U -G mapred,hdfs,hadoop -m -u <user_id> appuser
Next, it will be necessary to establish group relationships between the appuser, hadoop, mapred and hdfs users to allow for these users to be able to permissions to each other
add appuser to the Cloudera created groups
usermod -G hadoop,hdfs,mapred -a appuser
allow for group access to the appuser home directory
chmod 770 /home/appuser
add hdfs, mapred and hadoop to the appuser group if not done at time of user create
usermod -G appuser -a hdfs
usermod -G appuser -a mapred
usermod -G appuser -a hadoop
2) JVM options in Cloudera are a little different from Apache Hadoop. Instead of putting those options in the hadoop_env.sh, they will need to entered via the CM console
Navigate to Select Services -> click on hdfs1 -> Configuration->View And Edit ->NameNode -> Advanced -> Java Configuration Options for NameNode
Navigate to Select Services -> click on hdfs1 -> Configuration->View And Edit ->DataNode -> Advanced -> Java Configuration Options for DataNode
Navigate to Select Services -> click on mapreduce -> Configuration->View And Edit ->JobTracker -> Advanced -> Java Configuration Options for JobTracker
3) Need to set the supergroup in CM. The installer creates the default group, hadoop, but does not configure itself to use it.
Navigate to Services->hdfs1->Configuration->View and Edit->Service Wide->Security
Find the Superuser Group property (dfs.permissions.supergroup)
Change the value to hadoop
4) Modify the XML parser options for Map Reduce jobs. This is a common problem for jobs that need to use XML. This needs to be explicitly set or there could be a failure.
Navigate to Services->mapreduce1->Configuration->View and Edit->TaskTracker->Performance
Find the property MapReduce Child Java Opts Base (mapred.child.java.opts )
change it to -Djavax.xml.parsers.DocumentBuilderFactory=com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderFactoryImpl
5) Next, you need to deploy the Cloudera configuration. This will make the cluster configuration available to client applications. The CDH configuration is internal to CMF, unless it is specifically deployed.
Click on Action at Cluster level and select Deploy Client Configuration
You may want to double-check the /etc/hadoop/conf/hdfs-site.xml and /etc/hadoop/conf/mapred-site.xml to make sure all of your configuration options where created