Configuring a Hadoop Cluster

Environment Variables

First you need to set a few environment variables. We will provide the configurations for Bash and Tcsh shells which are the most common. Please choose the appropriate configuration based on the shell you are using.

Bash configuration

Set these environment variables in your .bashrc file. Please make sure to update the HADOOP_CONF_DIRECTORY to point to your HADOOP_CONF directory.

export HADOOP_HOME=/usr/local/hadoop
export HADOOP_COMMON_HOME=${HADOOP_HOME}
export HADOOP_HDFS_HOME=${HADOOP_HOME}
export HADOOP_MAPRED_HOME=${HADOOP_HOME}
export YARN_HOME=${HADOOP_HOME}
export HADOOP_CONF_DIR=<path to your conf dir>
export YARN_CONF_DIR=${HADOOP_CONF_DIR}
export HADOOP_LOG_DIR=/tmp/${USER}/hadoop-logs
export YARN_LOG_DIR=/tmp/${USER}/yarn-logs
export JAVA_HOME=/usr/local/jdk1.7.0_75-64

TCSH/CSH Configuration

You need to update your .chsrc file.

setenv HADOOP_HOME /usr/local/hadoop
setenv HADOOP_COMMON_HOME ${HADOOP_HOME}
setenv HADOOP_HDFS_HOME ${HADOOP_HOME}
setenv HADOOP_MAPRED_HOME ${HADOOP_HOME}
setenv YARN_HOME ${HADOOP_HOME}
setenv HADOOP_CONF_DIR <path to your conf dir>
setenv YARN_CONF_DIR ${HADOOP_CONF_DIR}
setenv HADOOP_LOG_DIR /tmp/${USER}/hadoop-logs
setenv YARN_LOG_DIR /tmp/${USER}/yarn-logs
setenv JAVA_HOME /usr/local/jdk1.7.0_75-64

HADOOP_CONF_DIR Contents

core-site.xml

You need to include your namenode information here. Replace dummy texts with appropriate hostname and port pair.

core-site.xml
<configuration>
    <property>
        <name>fs.default.name</name>
        <value>hdfs://NAME-NODE-HOST:PORT</value>
    </property>
</configuration>

hdfs-site.xml

The file contains configurations related to HDFS. Check for the word, PORT and replace them with an appropriate port number. Make sure that you use a different port number for each configuration unless specifically mentioned. SECONDARY-NAME-NODE refers to the node appearing in the 'masters' file.

hdfs-site.xml
<configuration>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>file:///tmp/hadoop-${user.name}</value>
    </property>
 
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>${hadoop.tmp.dir}/dfs/name</value>
    </property>
 
    <property>
        <name>hadoop.tmp.dir</name>
        <value>file:///tmp/hadoop-${user.name}</value>
    </property>
 
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>${hadoop.tmp.dir}/dfs/data</value>
    </property>
 
    <property>
        <name>dfs.namenode.http-address</name>
        <value>NAME-NODE-HOST:PORT</value>
        <description>Location of the DFS web UI</description>
    </property>
 
    <property>
        <name>dfs.namenode.secondary.http-address</name>
        <value>SECONDARY-NAME-NODE:PORT</value>
        <description>Web UI of the secondary name node</description>
    </property>
 
    <property>
        <name>dfs.datanode.address</name>
        <value>0.0.0.0:PORT</value>
        <description>The address where the datanode server will listen to. If the port is 0 then the server will start
            on a free port.
        </description>
    </property>
 
    <property>
        <name>dfs.datanode.http.address</name>
        <value>0.0.0.0:PORT</value>
        <description>The datanode http server address and port. If the port is 0 then the server will start on a free
            port.
        </description>
    </property>
 
    <property>
        <name>dfs.datanode.ipc.address</name>
        <value>0.0.0.0:PORT</value>
        <description>The datanode ipc server address and port. If the port is 0 then the server will start on a free
            port.
        </description>
    </property>
</configuration>

yarn-site.xml

This file contains information related to Yarn resource manager. Change the ports accordingly similar to the previous file.

yarn-site.xml
<configuration>
    <property>
        <name>yarn.resourcemanager.resource-tracker.address</name>
        <value>RESOURCE-MANAGER-HOST:PORT</value>
        <description>host is the hostname of the resource manager and
            port is the port on which the NodeManagers contact the Resource Manager.
        </description>
    </property>
 
    <property>
        <name>yarn.resourcemanager.scheduler.address</name>
        <value>RESOURCE-MANAGER-HOST:PORT</value>
        <description>host is the hostname of the resourcemanager and port is the port
            on which the Applications in the cluster talk to the Resource Manager.
        </description>
    </property>
 
    <property>
        <name>yarn.resourcemanager.scheduler.class</name>
        <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value>
        <description>In case you do not want to use the default scheduler</description>
    </property>
 
    <property>
        <name>yarn.resourcemanager.address</name>
        <value>RESOURCE-MANAGER-HOST:PORT</value>
        <description>the host is the hostname of the ResourceManager and the port is the port on
            which the clients can talk to the Resource Manager. </description>
    </property>
 
    <property>
        <name>yarn.nodemanager.local-dirs</name>
        <value>/tmp/hadoop-${user.name}/nodemanager/data</value>
        <description>the local directories used by the nodemanager</description>
    </property>
 
    <property>
        <name>yarn.nodemanager.address</name>
        <value>0.0.0.0:PORT</value>
        <description>the nodemanagers bind to this port</description>
    </property>
 
 
    <property>
        <name>yarn.nodemanager.remote-app-log-dir</name>
        <value>/tmp/hadoop-${user.name}/yarn-site-${user.name}/app-logs</value>
        <description>directory on hdfs where the application logs are moved to </description>
    </property>
 
    <property>
        <name>yarn.nodemanager.log-dirs</name>
        <value>/tmp/hadoop-${user.name}/yarn-site-${user.name}/nodemanagerLog</value>
        <description>the directories used by Nodemanagers as log directories</description>
    </property>
 
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
        <description>shuffle service that needs to be set for Map Reduce to run </description>
    </property>
 
    <property>
        <name>yarn.scheduler.capacity.root.queues</name>
        <value>default</value>
    </property>
 
    <property>
        <name>yarn.scheduler.capacity.root.default.capacity</name>
        <value>100</value>
    </property>
 
    <property>
        <name>yarn.nodemanager.localizer.address</name>
        <value>0.0.0.0:PORT</value>
        <description>Address where the localizer IPC is.</description>
    </property>
 
    <property>
        <name>yarn.nodemanager.webapp.address</name>
        <value>0.0.0.0:PORT</value>
        <description>NM Webapp address.</description>
    </property>
 
    <property>
        <name>yarn.resourcemanager.webapp.address</name>
        <value>0.0.0.0:PORT</value>
        <description>Address of the ResourceManager web app</description>
      </property>
 
    <!--  Other ports with may be overloaded.  Modify if you see error messages about the defaults already being in use
      <property>
       <name>yarn.resourcemanager.address</name>
       <value>0.0.0.0:7472</value>
       <description>The address of the applications manager interface in the RM.</description>
      </property>
 
      <property>
       <name>yarn.resourcemanager.scheduler.address</name>
       <value>0.0.0.0:7430</value>
       <description>The address of the scheduler interface.</description>
      </property>
 
      <property>
       <name>yarn.resourcemanager.resource-tracker.address</name>
       <value>0.0.0.0:7431</value>
      </property>
 
      <property>
       <name>yarn.resourcemanager.admin.address</name>
       <value>0.0.0.0:7433</value>
       <description>The address of the RM admin interface.</description>
      </property>
      -->
</configuration>

mapred-site.xml

This file contains configurations for the MapReduce environment.

mapred-site.xml
<configuration>
    <property>
        <name>mapreduce.cluster.temp.dir</name>
        <value>/tmp/${user.name}-tmp</value>
        <final>true</final>
    </property>
 
    <property>
        <name>mapreduce.cluster.local.dir</name>
        <value>${hadoop.tmp.dir}/dfs/data</value>
        <final>true</final>
    </property>
 
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
        <description>The runtime framework for executing MapReduce jobs.
            Can be one of local, classic or yarn.
        </description>
    </property>
 
    <property>
        <name>mapreduce.shuffle.port</name>
        <value>PORT</value>
    </property>
</configuration>

masters

This includes the secondary namenode host. Use a different host than the primary name node.

masters
SECONDARY_NAME_NODE

slaves

This includes the list of nodes that are used as datanodes for HDFS. Include one host per line.

slaves
DATA_NODE_1
DATA_NODE_2
 
tools/hw3.txt · Last modified: 2015/03/27 16:04 by thilinab