Tuesday, October 25, 2011

Hadoop HDFS and HBase Configuration

Hadoop HDFS and HBase Configuration

We assume that you have theoretical knowledge of hadoop, hdfs, hbase and ZooKeeper. This document will provide the basic configuration for hdfs, hbase and ZooKeeper.

Software Requirements for hadoop/Hbase:
1. JavaTM 1.6.x
2. ssh must be installed and sshd must be running to use the Hadoop scripts that manage remote Hadoop daemons.
3. Hadoop
4. HBase
5. Zoo Keeper

Machine Descriptions:
hbasemaster : Hbase Master
nameNode : Namenode

RS1: Region Server for Hbase
RS2: Region Server for Hbase
RS3: Region Server for Hbase

zk1: ZooKeeper Quorum
zk2: ZooKeeper Quorum
zk3: ZooKeeper Quorum

dt1: Data Nodes and Task Tracker
dt2: Data Nodes and Task Tracker
dt3: Data Nodes and Task Tracker
dt4: Data Nodes and Task Tracker

If we are searching for Job tracker, we will make it when we need to work on map reduce till then we do not need job tracker.

Hadoop Configuration
Unzip the hadoop folder in /home/hadoop/softwares i.e. /home/hadoop/softwares/hadoop-0.20.1/

In conf/hadoop-env.sh of hadoop-0.20.1,set JAVA_HOME: "/opt/jdk1.6.0_06",
Administrators can configure individual daemons using the configuration options HADOOP_*_OPTS. Various options available are shown below in the table.


Folder structure: (/home/hadoop/hdfs)
For data node: /home/hadoop/hdfs/data
For name node: /home/hadoop/hdfs/name

Note that we should have a common user named "hadoop" under a group named "supergroup".

ssh Configuration in Hadoop cluster

Step 1: Generate key at server machine
ssh-keygen -t dsa

Respond to the prompt:
• give empty passphrase (return key)
• leave default filepath or give

Step 2: Import to authorized keys
cat file.pub (eg ~/.ssh/id_dsa.pub) >> ~/.ssh/authorized_keys

Step 3: Change mode file of the following folders
• ~/.ssh-->700
• ~/.ssh/* -->644

Step 4: Copy public key from server to all nodes
--ssh-copy-id -i source-filename user@remotehostname(or ip)
--give passphrase of the user here

ssh destination host (or ip)
it should not ask for password

HDFS Configurations:

Hadoop configuration is driven by two types of important configuration files:
1. Read-only default configuration - src/core/core-default.xml, src/hdfs/hdfs-default.xml and src/mapred/mapred-default.xml.
2. Site-specific configuration - conf/core-site.xml, conf/hdfs-site.xml and conf/mapred-site.xml.

In ~/hadoop-0.20.1/conf, we need to make changes to core-site.xml and hdfs-site.xml configuration files for hadoop.

core-site.xml: We need to mention the ip address or domain name of Name node.
Parameter Value Notes
fs.default.name URI of NameNode. hdfs://namenode.XYZ.com:9001/

Parameter Value
dfs.name.dir /home/hadoop/hdfs/name
dfs.data.dir /home/hadoop/hdfs/data


List all slave hostnames or IP addresses in your conf/slaves file, one per line.

Starting Hadoop

To start a Hadoop cluster we will need to start both the HDFS and MapReduce

Format a new distributed filesystem:
$ bin/hadoop namenode -format

Start the HDFS with the following command, run on the designated NameNode:
$ bin/start-dfs.sh

The bin/start-dfs.sh script also consults the ${HADOOP_CONF_DIR}/slaves file on the NameNode and starts the DataNode daemon on all the listed slaves.

Stop HDFS with the following command, run on the designated NameNode:
$ bin/stop-dfs.sh

The bin/stop-dfs.sh script also consults the ${HADOOP_CONF_DIR}/slaves file on the NameNode and stops the DataNode daemon on all the listed slaves.

HBASE Configuration

Step 1 # Download HBase distribution from apache mirror. We are using Hbase 0.20.3
Step 2 # Extract the distribution to /home/hadoop/softwares/
Step 3 # Rename the folder name to hbase
Step 4 # Give all the permission (777) to the user to access hbase directory for user hadoop
Step 5 # Set the JAVA_HOME in hbase-site.xml file


Parameter value Description
hbase.rootdir hdfs://hbaseMaster.XYZ.com:9001/hbase
hbase.master hbaseMaster. XYZ.com
hbase.cluster.distributed * true The mode cluster will be in.true : fully distributed with
unmanaged Zookeeper when false : standalone or pseudo distributed with managed zookeeper
hbase.zookeeper.quorum * zk1.XYZ.com,zk2.XYZ.com,xk3.XYZ.com Comma separated list of servers in the ZooKeeper Quorum.
This is the list of servers which we will start/stop ZooKeeper on.

Step 6 # either put hdfs-site.xml in hbase path or copy hdfs-site.xml from hadoop installation directory to hbase/conf directory

Parameter Value Description
dfs.data.dir /home/hadoop/hdfs/data
dfs.name.dir /home/hadoop/hdfs/name
dfs.namenode.logging.level all The logging level for dfs namenode. Other values are "dir"(trac
e namespace mutations), "block"(trace block under/over replications and block
creations/deletions), or "all"
dfs.datanode.socket.write.timeout 0
dfs.datanode.max.xcievers 2048
dfs.datanode.handler.count 10

step 7 # Set the regionServers list on regionservers file in hbase/conf

step 8 #Give the permissions to hadoop user for hbase directory and hadoop directory (CHMOD 755)
step 9 # edit ~/.bashrc file and append ulimit -c 2048 to the end of the file being root user
step 10 # edit /etc/security/limits.conf to include the following two lines
hadoop soft nofile 32768
hadoop hard nofile 32768

step 11 # start the hdfs nameNode on the namenode master

step 12# start the HBase system from the hbase directory
Step #13 start hbase shell
bin/hbase shell

Hadoop commands

To view the details of HM directory

bin/hadoop dfs –ls /user/local/input/HM/
bin/hadoop dfs –cat /user/local/input/HM/files.txt

The right number of reduces seems to be 0.95 or 1.75 multiplied by ( * mapred.tasktracker.reduce.tasks.maximum) see apache mapreduce
The right level of parallelism for maps seems to be around 10-100 maps per-node

Setting replication factor for a directory in HDFS

hadoop dfs -setrep -w 3 -R /user/hadoop/dir1

see hadoop distributedCache for tutorial on Hadoop's Distributed Cache