Sunday, October 21, 2012

Deploying a GraphLab cluster on EC2

I got the following instructions from my collaborator Jay (Haijie Gu)who spent some time learning Spark cluster deployment and adapted those useful scripts to be used in GraphLab.
This tutorial will help you spawn a GraphLab distributed cluster, run alternating least squares task, collect the results and shutdown the cluster.

Note: the latest version of this tutorial has moved to here: http://graphlab.org/tutorials-2/graphlab-on-ec2-cluster-quick-start/

Step 0: Requirements

1) You should have Amazon EC2 account eligible to run on us-east-1a zone.
2) Find out using the Amazon AWS console your AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY
3) Download your private/public key pair (called here graphlab.pem)
4) Download Graphlab 2.1 using the instructions here.

Step 1: Environment setup

Edit your .bashrc or .bash_profile (remember to source it after editing)
export AWS_ACCESS_KEY_ID=[ Your access key ]
export AWS_SECRET_ACCESS_KEY=[ Your access key secret ]

Step 2: Start the cluster

$ cd ~/graphlabapi/scripts/ec2 $ ./gl-ec2 -i ~/.ssh/graphlab.pem -k graphlabkey -z us-east-1a -s 1 launch launchtest
 (In the above command, we created a 2-node cluster in us-east-1a zone. -s is the number of slaves, and launch is the action, and launchtest is the name of the cluster) only once when starting the image.

Step 2.2: Start Hadoop (mandatory when using HDFS)

This operation is needed when you want to work with HDFS
$  ./gl-ec2 -i ~/.ssh/graphlab.pem -k graphlabkey start-hadoop launchtest

Step 3: Run alternating least squares demo

This step runs ALS (alternating least squares) in a cluster using small netflix susbset.
It first downloads the data from the web: http://www.select.cs.cmu.edu/code/graphlab/datasets/smallnetflix_mm.train and http://www.select.cs.cmu.edu/code/graphlab/datasets/smallnetflix_mm.validate, copy it into HDFS, and run 5 alternating least squares iterations:

./gl-ec2 -i ~/.ssh/graphlab.pem -k graphlabkey als_demo launchtest 

After the run is completed, you can login into the master node and view the output files in the folder ~/graphlabapi/release/toolkits/collaborative_filtering/ The algorithm and exact format is explained here.

Step 4: shutdown the cluster

$ ./gl-ec2 -i ~/.ssh/graphlab.pem -k grpahlabkey destroy launchtest

Advanced functionality:

Step 5: Login into the master node

$ ./gl-ec2 -i ~/.ssh/graphlab.pem -k graphlabkey login launchtest

Step 6: Manual building of GraphLab code

On the master:

 cd ~/graphlabapi/release/toolkits hg pull; hg update;
make
/* Sync the binary folder to slaves */ 
 cd ~/graphlabapi/release/toolkits;  ~/graphlabapi/scripts/mpirsync 

 /* Sync the local dependency folder to slaves */ cd ~/graphlabapi/deps/local; ~/graphlabapi/scripts/mpirsync

Manual run of ALS demo


 Login into the master node
cd graphlabapi/release/toolkits/collaborative_filtering/ 
mkdir smallnetflix 
cd smallnetflix/ 
wget http://www.select.cs.cmu.edu/code/graphlab/datasets/smallnetflix_mm.train 
wget http://www.select.cs.cmu.edu/code/graphlab/datasets/smallnetflix_mm.validate 
cd .. 
hadoop fs -copyFromLocal smallnetflix/ / 
mpiexec -n 2 ./als --matrix hdfs://`hostname`/smallnetflix --max_iter=3 --ncpus=1

Troubleshooting

Known Errors:
Starting the dfs: namenode running as process 1302. Stop it first. 
localhost: datanode running as process 1435. Stop it first. 
ip-10-4-51-142: secondarynamenode running as process 1568. Stop it first. 
Starting map reduce: jobtracker running as process 1647. Stop it first. 
localhost: tasktracker running as process 1774. Stop it first. 

 Solution: Kill hadoop and restart it again using the commands:

./gl-ec2 -i ~/.ssh/graphlab.pem -k graphlabkey stop-hadoop launchtest

./gl-ec2 -i ~/.ssh/graphlab.pem -k graphlabkey start-hadoop launchtest

Error:
12/10/20 13:37:18 INFO ipc.Client: Retrying connect to server: domU-12-31-39-16-86-CC/10.96.133.54:8020. Already tried 0 time(s).

Solution: run jps to verify that one of the Hadoop nodes failed.

./gl-ec2 -i ~/.ssh/graphlab.pem -k graphlabkey jps launchtest

> jps
1669 TaskTracker
2087 Jps
1464 SecondaryNameNode
1329 DataNode
1542 JobTracker
In the above example, NameNode is missing (not running).  Stop hadoop execution using stop-hadoop command line.

Error:
mpiexec was unable to launch the specified application as it could not access
or execute an executable:

Executable: /home/ubuntu/graphlabapi/release/toolkits/graph_analytics/pagerank
Node: domU-12-31-39-0E-C8-D2

while attempting to start process rank 0.

Solution:
Executable is missing. Run update:


Error:

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGILL (0x4) at pc=0x000000000056c0be, pid=1638, tid=140316305243104
#
# JRE version: 6.0_26-b03
# Java VM: Java HotSpot(TM) 64-Bit Server VM (20.1-b02 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C  [als+0x16c0be]  graphlab::distributed_ingress_base<vertex_data, edge_data>::finalize()+0xe0e
#
# An error report file with more information is saved as:
# /home/ubuntu/graphlabapi/release/toolkits/collaborative_filtering/hs_err_pid1638.log
#
# If you would like to submit a bug report, please visit:
#   http://java.sun.com/webapps/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#

Solution:
1) Update the code:

$./gl-ec2 -i ~/.ssh/graphlab.pem -k graphlabkey update launchtest


2) If the problem still persists submit a bug report to GraphLab users list.


Error:
bickson@thrust:~/graphlab2.1/graphlabapi/scripts/ec2$ ./gl-ec2 -i ~/.ssh/graphlab.pem -k graphlabkey login launchtest
ERROR: The environment variable AWS_ACCESS_KEY_ID must be set

Solution:
Need to set environment variables, as explained in step 1.

2 comments:

  1. Hi Danny,

    Thank you for the long waited manual. I have a couple of issues here:

    1- You may want to specify that grapgLab should be downloaded from the googleCode not from available releases because I check the latest release (v2.1.4245) and it does not have the "ec2" folder.

    2- When I tried to start the cluster I got this error:

    GraphLab AMI for Standard Instances: ami-d7a418be
    Launching instances...
    Could not find AMI ami-d7a418be

    I think the ami may have some security that prevent other ppl from accessing it.

    Thanks

    ReplyDelete
    Replies
    1. Hi Ammar!
      Thanks for trying out my instructions!
      Regarding 1 - we will add this folder ASAP to the build, thanks for catching this!
      Regarding 2 - I have changed the permission to public. Please try again!

      Delete