We assume that you have theoretical knowledge of hadoop, hdfs, hbase and ZooKeeper. This document will provide the basic configuration for hdfs, hbase and ZooKeeper.
Software Requirements for hadoop/Hbase:
1. JavaTM 1.6.x
2. ssh must be installed and sshd must be running to use the Hadoop scripts that manage remote Hadoop daemons.
3. Hadoop
4. HBase
5. Zoo Keeper
Machine Descriptions:
hbasemaster : Hbase Master
nameNode : Namenode
RS1: Region Server for Hbase
RS2: Region Server for Hbase
RS3: Region Server for Hbase
zk1: ZooKeeper Quorum
zk2: ZooKeeper Quorum
zk3: ZooKeeper Quorum
dt1: Data Nodes and Task Tracker
dt2: Data Nodes and Task Tracker
dt3: Data Nodes and Task Tracker
dt4: Data Nodes and Task Tracker
If we are searching for Job tracker, we will make it when we need to work on map reduce till then we do not need job tracker.
Hadoop Configuration
Unzip the hadoop folder in /home/hadoop/softwares i.e. /home/hadoop/softwares/hadoop-0.20.1/
In conf/hadoop-env.sh of hadoop-0.20.1,set JAVA_HOME: "/opt/jdk1.6.0_06",
Administrators can configure individual daemons using the configuration options HADOOP_*_OPTS. Various options available are shown below in the table.
NameNode HADOOP_NAMENODE_OPTS
DataNode HADOOP_DATANODE_OPTS
SecondaryNamenode HADOOP_SECONDARYNAMENODE_OPTS
JobTracker HADOOP_JOBTRACKER_OPTS
TaskTracker HADOOP_TASKTRACKER_OPTS
Folder structure: (/home/hadoop/hdfs)
For data node: /home/hadoop/hdfs/data
For name node: /home/hadoop/hdfs/name
Note that we should have a common user named "hadoop" under a group named "supergroup".
ssh Configuration in Hadoop cluster
Step 1: Generate key at server machine
ssh-keygen -t dsa
Respond to the prompt:
• give empty passphrase (return key)
• leave default filepath or give
Step 2: Import to authorized keys
cat file.pub (eg ~/.ssh/id_dsa.pub) >> ~/.ssh/authorized_keys
Step 3: Change mode file of the following folders
• ~/.ssh-->700
• ~/.ssh/* -->644
Step 4: Copy public key from server to all nodes
--ssh-copy-id -i source-filename user@remotehostname(or ip)
--give passphrase of the user here
Verification:
ssh destination host (or ip)
it should not ask for password
HDFS Configurations:
Hadoop configuration is driven by two types of important configuration files:
1. Read-only default configuration - src/core/core-default.xml, src/hdfs/hdfs-default.xml and src/mapred/mapred-default.xml.
2. Site-specific configuration - conf/core-site.xml, conf/hdfs-site.xml and conf/mapred-site.xml.
In ~/hadoop-0.20.1/conf, we need to make changes to core-site.xml and hdfs-site.xml configuration files for hadoop.
core-site.xml: We need to mention the ip address or domain name of Name node.
Parameter Value Notes
fs.default.name URI of NameNode. hdfs://namenode.XYZ.com:9001/
conf/hdfs-site.xml
Parameter Value
dfs.name.dir /home/hadoop/hdfs/name
dfs.data.dir /home/hadoop/hdfs/data
Slaves
List all slave hostnames or IP addresses in your conf/slaves file, one per line.
Starting Hadoop
To start a Hadoop cluster we will need to start both the HDFS and MapReduce
Format a new distributed filesystem:
$ bin/hadoop namenode -format
Start the HDFS with the following command, run on the designated NameNode:
$ bin/start-dfs.sh
The bin/start-dfs.sh script also consults the ${HADOOP_CONF_DIR}/slaves file on the NameNode and starts the DataNode daemon on all the listed slaves.
Stop HDFS with the following command, run on the designated NameNode:
$ bin/stop-dfs.sh
The bin/stop-dfs.sh script also consults the ${HADOOP_CONF_DIR}/slaves file on the NameNode and stops the DataNode daemon on all the listed slaves.
HBASE Configuration
Step 1 # Download HBase distribution from apache mirror. We are using Hbase 0.20.3
Step 2 # Extract the distribution to /home/hadoop/softwares/
Step 3 # Rename the folder name to hbase
Step 4 # Give all the permission (777) to the user to access hbase directory for user hadoop
Step 5 # Set the JAVA_HOME in hbase-site.xml file
hbase-site.conf
Parameter value Description
hbase.rootdir hdfs://hbaseMaster.XYZ.com:9001/hbase
hbase.master hbaseMaster. XYZ.com
hbase.cluster.distributed * true The mode cluster will be in.true : fully distributed with
unmanaged Zookeeper when false : standalone or pseudo distributed with managed zookeeper
hbase.zookeeper.quorum * zk1.XYZ.com,zk2.XYZ.com,xk3.XYZ.com Comma separated list of servers in the ZooKeeper Quorum.
This is the list of servers which we will start/stop ZooKeeper on.
Step 6 # either put hdfs-site.xml in hbase path or copy hdfs-site.xml from hadoop installation directory to hbase/conf directory
hdfs-site.conf
Parameter Value Description
dfs.data.dir /home/hadoop/hdfs/data
dfs.name.dir /home/hadoop/hdfs/name
dfs.namenode.logging.level all The logging level for dfs namenode. Other values are "dir"(trac
e namespace mutations), "block"(trace block under/over replications and block
creations/deletions), or "all"
dfs.datanode.socket.write.timeout 0
dfs.datanode.max.xcievers 2048
dfs.datanode.handler.count 10
step 7 # Set the regionServers list on regionservers file in hbase/conf
RS1.d2hs.com
RS2.d2hs.com
RS3.d2hs.com
step 8 #Give the permissions to hadoop user for hbase directory and hadoop directory (CHMOD 755)
step 9 # edit ~/.bashrc file and append ulimit -c 2048 to the end of the file being root user
step 10 # edit /etc/security/limits.conf to include the following two lines
hadoop soft nofile 32768
hadoop hard nofile 32768
step 11 # start the hdfs nameNode on the namenode master
bin/start-dfs.sh
step 12# start the HBase system from the hbase directory
bin/hbase-start-hbase.sh
Step #13 start hbase shell
bin/hbase shell
Hadoop commands
To view the details of HM directory
bin/hadoop dfs –ls /user/local/input/HM/
bin/hadoop dfs –cat /user/local/input/HM/files.txt
The right number of reduces seems to be 0.95 or 1.75 multiplied by (
The right level of parallelism for maps seems to be around 10-100 maps per-node
Setting replication factor for a directory in HDFS
hadoop dfs -setrep -w 3 -R /user/hadoop/dir1
see hadoop distributedCache for tutorial on Hadoop's Distributed Cache