Tuesday, February 8, 2011

Hadoop on Amazon EC2 - Part 4 - Running on a cluster

1) Edit the file conf/hdfs-conf.xml
Set the number of replicas as the number of nodes you plan to use. In this example, 4.


 
  hadoop.tmp.dir
   /mnt/tmp/
  
  
   dfs.data.dir
   /mnt/tmp2/
   
 
   dfs.name.dir
   /mnt/tmp3/
   
  dfs.replication 
  4
  Default block replication.
  The actual number of replications can be specified when the file is created.
  The default is used if replication is not specified in create time.
   



2) Edit the file conf/slaves and list the DNS names of all of the machines you are going to use. For example:
 ec2-67-202-45-10.compute-1.amazonaws.com
 ec2-67-202-45-11.compute-1.amazonaws.com
 ec2-67-202-45-12.compute-1.amazonaws.com
 ec2-67-202-45-13.compute-1.amazonaws.com 

3) Edit the file conf/master and enter the DNS name of the master node. For example
ec2-67-202-45-10.compute-1.amazonaws.com

Note that the master node can appear also in the salves list.

4) Edit the file conf/core-site.xml to include the master name

  
    fs.default.name
    hdfs://ec2-67-202-45-10.compute-1.amazonaws.com:9000
  

  
    mapred.job.tracker
    ec2-67-202-45-10.compute-1.amazonaws.com:9001
  

  
  hadoop.tmp.dir
   /mnt/tmp/
  

5) Edit the file conf/mapred-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
  <property>
    <name>fs.default.name</name> 
    <value>hdfs://ec2-67-202-45-10.compute-1.amazonaws.com:9000</value>
  </property>

  <property>
    <name>mapred.job.tracker</name>
  <value>ec2-67-202-45-10.compute-1.amazonaws.com:9001</value>
  </property>
  <property>
  <name>hadoop.tmp.dir</name>
   <value>/mnt/tmp/</value>
  </property>

  <property>
  <name>mapred.map.tasks</name>
   <value>10</value> <!-- about the number of cores>
  </property>

   <property>
  <name>mapred.reduce.tasks</name>
   <value>10</value> <!-- about the number of cores>
  </property>

  <property>
  <name>mapred.tasktracker.map.tasks.maximum</name>
   <value>12</value> <!-- slightly more than cores>  </property>

   <property>
  <name>mapred.tasktracker.reduce.tasks.maximum</name>
   <value>12</value> <!-- slightly more than cores>
  </property>
   
</configuration>
6) Login into the master node. For each of the 3 slaves machines, copy the DSA key from the master node:
sh-copy-id -i ~/.ssh/id_dsa.pub ec2-67-202-45-11.compute-1.amazonaws.com
ssh-copy-id -i ~/.ssh/id_dsa.pub ec2-67-202-45-12.compute-1.amazonaws.com
ssh-copy-id -i ~/.ssh/id_dsa.pub ec2-67-202-45-13.compute-1.amazonaws.com

7) To start Hadoop. On the master machine
/usr/local/hadoop-0.20.2/bin/hadoop namenode -format
/usr/local/hadoop-0.20.2/bin/start-dfs.sh
/usr/local/hadoop-0.20.2/bin/start-mapred.sh

8) To stop Hadoop
/usr/local/hadoop-0.20.2/bin/stop-mapred.sh
/usr/local/hadoop-0.20.2/bin/stop-dfs.sh

No comments:

Post a Comment