Thursday, July 28, 2011

Identifying structure in large social and information networks

I got the following from Aapo Kyrola:

Regarding partitioning, I recommend reading Stanford's Prof Mahoney's tutorial
about real-world graphs:
Here is a link to the video of the tutorial:

Basically he argues that although, for example, social network graphs have
local structure (such as your friend network in facebook might be nicely clustered),
they lack any global structure. This makes at least current methods of graph partitioning useless on such graphs.

Monday, July 25, 2011

GraphLab vs. Spark

I had a very interesting meeting today with Matei Zaharia, a graduate student from Berkeley with a very impressive publication record. He is one of the the authors of Spark, a parallel cloud architecture targeted for iterative and interactive algorithms, where Hadoop does not perform well.

I was impressed with the demo of Spark, using 20 Amazon EC2 machines. Spark is implemented using Scala, which allows for convenient Java like programming interface.

What caught my attention, is that on top of Spark, one of the implemented algorithms was alternating least squares. A work to be reported in the coming Sigcomm:
M. Chowdhury, M. Zaharia, J. Ma, M.I. Jordan and I. Stoica, Managing Data Transfers in Computer Clusters with Orchestra.
Here are some performance number given by Matei:
4 cores (master + 0 workers): 272s
12 cores (master + 1 worker): 106s
20 cores (master + 2 workers): 72s
28 cores (master + 3 workers): 57s
36 cores (master + 4 workers): 50s

Next, I tested multicore GraphLab using the same data, and same factorized matrix width (D=60). One iteration in GraphLab of alternating least squares takes 106 seconds on 8 cores, using an AMD Opteron multicore machine. This is very close
to Spark results with 12 cores.

Overall conclusion (by Matei:) I think the main takeaway from this is that you should absolutely use a lot of cores on one machine if you have them, as communication is much faster. When you add nodes, the communication cost will lower the overall performance-per-CPU, but you will get lower response times too.

Thursday, July 21, 2011

Fighting with Amazon EC2 AMI

There is no doubt that Amazon EC2 is one of the most successful and useful cloud services. However, a few days ago I had the nerve breaking experience of trying to ec2-register an amazon AMI image. This task is needed when you want to save your work so you can easily load it next time you run. Amazon AMI tools is one of the worst designed and implemented tools I have ever encountered. You need a lot of patience when dealing with it. I wrote down some of the errors I encountered.
(I thought that by having a PhD and working for 15 years on Linux I will be immune to this kinds of errors, but I was absolutely wrong..) The reader should be warned,
that I did not collect error on the web, I simply encountered all of the below errors, until eventually I got so tired so I did not document everything from a certain point.

Basically, what you want to do is to run 3 commands. Usually it should not take more than a few minutes. However, if you manage to run those command in less than a few hours you are absolutely lucky.
Those are the command you like to run:
sudo -E /opt/aws/bin/ec2-bundle-vol -k [path to your x.509 private key] -c [path to your x.509 public key] -u [Amazon 12 digit user id] -d /mnt -p [bundle file name] -r x86_64
sudo -E /opt/aws/bin/ec2-upload-bundle -b [bundle file name] -m /mnt/[bundle file name].manifest.xml -s [Amazon AWS secret string] -a [Amazon AWS ID string]
sudo -E ec2-register -K [location of X.509 private key] -C [location of x.509 certificate] --name [bucket name]/[image name] --region us-east-1 [bucket name]/[image name].manifest.xml

Potential problems. If the process failed and you tried again you may get an error:
/opt/aws/amitools/ec2/lib/ec2/platform/linux/image.rb:154:in `mount_image': image already mounted (FatalError)
    from /opt/aws/amitools/ec2/lib/ec2/platform/linux/image.rb:81:in `make'
    from /opt/aws/amitools/ec2/lib/ec2/amitools/bundlevol.rb:151:in `bundle_vol'
    from /opt/aws/amitools/ec2/lib/ec2/amitools/bundlevol.rb:193:in `main'
    from /opt/aws/amitools/ec2/lib/ec2/amitools/tool_base.rb:201:in `run'
    from /opt/aws/amitools/ec2/lib/ec2/amitools/bundlevol.rb:201

solution: using the "mount" command find the mounted image and unmount it using the "sudo unmount XXX" command.

/opt/aws/bin/ec2-bundle-vol: line 3: EC2_HOME: Neither of EC2_AMITOOL_HOME or EC2_HOME environment variables are set
Assuming AMITOOL is isntalled, try to use sudo -E. If this did not work, try to set (assuming working on bash shell):
export EC2_AMITOOL_HOME=/home/ubuntu/ami-tools/ec2-ami-tools-1.3-57676/
where you should point the path to where ami-tools are installed. If they are not installed
you need to install EC2 AMITOOLs. And then set the environment variable using "setenv" or "export" command.

--user has invalid value 'AKIAJWASWE2DSWQFKILA': the user ID should consist of 12 digits (optionally hyphenated); this should not be your Access Key ID
Try 'ec2-bundle-vol --help'

You ou gave the wrong key, look for a numeric key id in amazon AWS website of the format 0000-0000-0000. This key is especially hard to find within all the menus.

ERROR: the specified image file /mnt/ already exists

Solution: remove the image file created using the command
sudo rm -fR /mnt/

The specified bucket is not S3 v2 safe (see S3 documentation for details):
Solution: Looks like an EC2 bug - underscore and capital letters are allowed but result in this warning. If you try to ignore this warning at this point, you will get much worser errors later. Try to avoid this warning.

mke2fs 1.41.12 (17-May-2010)
error writing /etc/mtab.tmp: No space left on device

You tried so many times, you got out of disk space.. Need to clean up files or restart image and retry again.

Neither a 'manifest' or 'block-device-mapping' have been specified; at least one is required. (-h for usage)

You should have both used the -n flag to specify a bucket name, and then the path of the bucketname/imagename.manifest.xml . By the way bucket name is flexible - it does not have to be image name.

1) Client.InvalidManifest: HTTP 403 (Forbidden) response for URL check your S3 ACLs are correct.
2) Client.InvalidManifest: HTTP 404 (Not Found) response for URL check your manifest path is correct and in the correct region.

something in the process has gone wrong - either bucket name is wrong or upload failed.. Need to do everything correctly from the beginning.

ERROR: Parameter problem: Expecting S3 URI with just the bucket name set instead of 'graphlab_org_release_v1234'
Need to add s3:// when using the command: s3cmd mb

ERROR: Error talking to S3: Curl.Error(51): SSL: certificate subject name '*' does not match target host name ''.
Solution: no clue what I did - I started to get out of focus at this point. Probably started all over again.. :-(

Problem: Client.InvalidAMIName.Duplicate: AMI name graphlaborgreleasev1234 is already in use by AMI ami-98946ef1
Solution: this happens when you try to register a new AMI with a name you already gave to an older AMI need to rename.

Hopefully, after all this mess, you managed to ec2-register.. and got a printout of
the type:
IMAGE AMI-12120930
And I ask : why not simply add a UI option from AWS consule to register an image???

Final comment: I have a quick email exchange with James Hamilton, VP in Amazon and I sent him this link.  I got back the following note: Sorry you had a bad experience with EC2.
I would like to take this opportunity to clarify that my overall experience with EC2 is very good. But still some interfaces could be improved.

Monday, July 18, 2011

Hearst Machine Learning Challenge - Converting inputs to SVMLight format

After the excitement following our 5th place in KDD CUP 2011 is a little over, I started looking at other interesting problems. The hearst machine learning challenge has some interesting data. About 1M emails are given with 273 sparse features. The task is to classify some validation emails, and decide whether the user has opened the email and if he clicked on the link within the email. The problem is not so easy since the data is highly skewed - most users ignore ad emails
as spam, so the number of positive examples is rather low.

One of the classic ways of solving the classification problem is using SVM (support vector machine). SVMLight is a popular implementation of SVM solver.

Here is a short script I wrote for converting Hearst machine learning challenge data into SVMLight format (and also pegasos format).

%function for converting hearst data to svm light format
%Input: number - the file number. 1-5 Model files. 6 - validation.
%       doclick or doopen - one of them should be 1 and the other zero, depends on which target.
%Written by Danny Bickson, CMU, July 2011.
%This script converts hearst machine learning challenge data into SVMlight format
%namely:    ...
% for example
%-1 3:15.4 4:18 19:32

function []=convert2svm(number,doclick, doopen)

assert(number>=1 && number<=6);
row_offset = [0 400000 800000 1200000 1600000 0];
rows=[400000 400000 400000 400000 185421 9956];

assert(~(doopen && doclick));
assert(doclick || doopen);
terms273 = {'Sun', 'Mon','Tue', 'Wed', 'Thu', 'Fri', 'Sat'};
ids   = num2cell(1:length(terms273));
dict273  = reshape({terms273{:};ids{:}},2,[]);
dict273  = struct(dict273{:});

if (number == 6)
  fid=fopen(['Modeling_', num2str(number), '.csv'],'r');
  if (doclick)
    outid=fopen(['svm', num2str(number), '.txt'],'w');
    outid=fopen(['2svm', num2str(number), '.txt'],'w');


title=textscan(fid, '%s', 273, 'delimiter', ','); % read title
title{274} = 'date';% field no. 273 is mistakenly parsed into two fields in matlab because of a ","
% go over rows
for j=1:rows(number)-1
  if (mod(j,500) == 0)
      disp(['row ', num2str(j)]);
for j=1:rows(number)-1
  if (mod(j,500) == 0)
      disp(['row ', num2str(j)]);
  a=textscan(fid, '%s', 274,'delimiter', ',');
  for i=1:cols
      if (i == 1|| i == 2) %handle target
        if ((doclick&&i==1) || (doopen&&i==2))
           if (number == 6)
             fprintf(outid,'%d ', -1); %target is unknown, write -1 as a placeholder
             fprintf(outid,'%d ', (2*strcmp(a{i},'Y'))-1);
      elseif (~strcmp(a{i} ,''))%if feature is non zero
          if (i == 73) % translate field of the type A01, B03, J05, etc. quickly into a number
              val = val(1)*26+val(3);
          elseif (i==273)
              val = val(2:end); %remove quatation mark 
              val = dict273.(val);
          elseif (i==274) % translate date into a number
              val = datenum(a{274});
              if (length(val) == 1)
                val = uint8(val);
              elseif (sum(isletter(val))==0) % string is all digits, translate to double
                val = str2double(val);
                val = sum(uint8(val));%translate a string into a number, using sun of chars, can use more fancy methods here

          fprintf(outid, '%d:%f ', i-2, val); % remove two from field number since first two fields are targets
  fprintf(outid, '\n');

The script can be actually run in parallel on multicore machine. The way to run it is to execute the following in a Linux shell (optimally if you have 11 cores):
for i in `seq 1 1 6`
matlab -r "convert2svm($i,1,0)" &
matlab -r "convert2svm($i,0,1)" &
The resulting files are svm1.txt -> svm5.txt (using first target - open email), files 2svm1.txt -> 2svm5.txt (using second target - click email) and the validation.txt file. Next you can merge the files using the command
cat svm1.txt > total.txt
for i in `seq 2 1 5`
cat svm$i.txt >> total.txt

GraphLab - Machine Learning in the Cloud

A few days ago we have released a detailed technical report about the performance of distributed GraphLab on Amazon EC2 with up to 64 nodes (512 cores total) :

We compared GraphLab using three applications: matrix factorization, CoEM (a variant of personalized pagerank, a named entity recognition algorithm), and video co-segmentation. 

As a reference we compared three platforms: Hadoop, MPI (message-passing-interface) and GraphLab. In a nutshell, GraphLab runs about 20x to 100x times faster than Hadoop, depending on the data and application. The main reason is that we perform all computation in memory and do not provide any fault tolerance. Compared to MPI, GraphLab has a similar performance. The drawback of MPI is that the code has to be rewritten for each application, while GraphLab provides building blocks for iterative computation.

The following graph shows the speedup of the 3 applications using 64 Amazon HPC machines:

The baseline for speedup calculation are 4 machines. For matrix factorization (line denoted as Netflix) we get a speedup of x16 on x64 machines.  For video co-segmentation we get a speedup of x40 on x64 EC2 nodes.
When we increase factorized matrix width, the problem becomes computation heavy and we get
even a better speedup of x40 on 64 nodes.