Saturday, October 22, 2011

Speeding up SVD on a mega matrix using GraphLab - part 2

It seems I struck some nerve, since I am getting a lot of comments about this blog post.

First, here is an email I got from Bryan, Research Programmer at CMU:
"The machine in question, named "thrust", features two Xeon X5690 CPUs. Each one has 12 hyperthreaded cores, clock speed 3.47GHz, 64 KB L1 cache/core (32 KB L1 Data + 32 KB L1 Instruction) and 256 KB L2 cache/core, and 12MB L3 cache. It's got 192GB of RAM. We're running Ubuntu 10.04 LTS with the Ubuntu-supplied underlying libraries (boost, ITPP, etc.), some of which are a bit older than the ones supplied to the newer version of Ubuntu mentioned on your blog.


No problem on getting you access to this machine. The only difficulty would be in coordinating who gets to use it when. We have other somewhat less well-endowed machines of identical hardware platform and software setup that see more idle time where it'd be easier to fit in runs that take only a few tens of GB of RAM and, say, half a dozen cores.


To elaborate on what I observed during Abhay's run, I present the following strage wall of numbers excerpted from a home-grown load monitoring script:


Fri-Oct-21-03:31:23 1319182283 13.2 3.0 0.0 7.6 0.1 0.6/ 13.2 0.0/12.4
Fri-Oct-21-03:31:28 1319182288 5.711.0 0.0 9.0 0.0 1.3/ 29.6 0.0/ 0.0
Fri-Oct-21-03:31:33 1319182293 6.7 1.7 0.016.5 0.0 1.3/ 30.0 0.0/ 0.0
Fri-Oct-21-03:31:38 1319182298 10.4 2.1 0.012.4 0.0 1.3/ 30.5 0.0/ 0.0
Fri-Oct-21-03:31:43 1319182303 6.8 8.8 0.0 9.1 0.0 1.8/ 31.2 0.0/14.0
Fri-Oct-21-03:31:48 1319182308 4.5 2.1 0.018.4 0.0 1.3/ 31.2 0.0/ 0.0
Fri-Oct-21-03:31:53 1319182313 1.9 1.2 0.021.6 0.0 1.3/ 30.1 0.0/ 0.0
Fri-Oct-21-03:31:58 1319182318 13.2 2.4 0.0 9.1 0.0 1.3/ 31.6 0.0/ 0.0
Fri-Oct-21-03:32:03 1319182323 5.211.8 0.0 7.7 0.0 1.4/ 36.5 0.0/ 0.0
Fri-Oct-21-03:32:08 1319182328 7.5 1.8 0.015.7 0.0 1.5/ 37.7 0.0/ 0.0
Fri-Oct-21-03:32:13 1319182333 9.7 1.6 0.013.8 0.0 2.2/ 33.0 0.0/ 0.0
Fri-Oct-21-03:32:18 1319182338 7.9 8.1 0.0 8.8 0.0 1.4/ 34.4 0.0/14.0
Fri-Oct-21-03:32:23 1319182343 4.5 2.1 0.018.3 0.0 1.5/ 37.3 0.0/ 0.0
Fri-Oct-21-03:32:28 1319182348 2.0 1.7 0.021.0 0.0 1.5/ 40.7 0.0/ 0.0
Fri-Oct-21-03:32:33 1319182353 13.2 2.1 0.0 9.7 0.0 1.6/ 42.3 0.0/ 0.0
Fri-Oct-21-03:32:38 1319182358 5.212.7 0.0 6.6 0.0 1.6/ 43.8 0.0/ 0.0
Fri-Oct-21-03:32:43 1319182363 8.1 2.1 0.014.7 0.0 6.7/ 24.8 0.0/ 0.0
Fri-Oct-21-03:32:48 1319182368 7.9 1.4 0.015.6 0.0 1.3/ 28.1 0.0/ 0.0
Fri-Oct-21-03:32:53 1319182373 8.8 8.9 0.0 7.1 0.0 1.3/ 28.3 0.0/14.0
Fri-Oct-21-03:32:58 1319182378 4.7 2.2 0.018.1 0.0 1.3/ 28.3 0.0/ 0.0
Fri-Oct-21-03:33:03 1319182383 2.4 1.2 0.021.2 0.0 1.3/ 28.0 0.0/ 0.0
Fri-Oct-21-03:33:08 1319182388 11.6 2.1 0.011.3 0.0 1.3/ 25.8 0.0/ 0.0
Fri-Oct-21-03:33:13 1319182393 6.212.5 0.0 5.7 0.0 2.0/ 27.9 0.0/ 0.0
Fri-Oct-21-03:33:18 1319182398 8.9 2.0 0.014.0 0.0 1.5/ 32.7 0.0/ 0.0
Fri-Oct-21-03:33:23 1319182403 6.2 1.8 0.016.8 0.0 1.3/ 30.6 0.0/ 0.0
Fri-Oct-21-03:33:28 1319182408 10.4 7.9 0.0 6.4 0.0 1.4/ 29.8 0.0/14.0
Fri-Oct-21-03:33:33 1319182413 4.6 2.1 0.018.4 0.0 1.4/ 29.8 0.0/ 0.0
Fri-Oct-21-03:33:38 1319182418 2.5 1.8 0.020.7 0.0 1.2/ 27.1 0.0/ 0.0
Fri-Oct-21-03:33:43 1319182423 10.1 1.8 0.013.1 0.0 1.9/ 23.8 0.0/ 0.0
Fri-Oct-21-03:33:48 1319182428 6.912.0 0.0 5.6 0.0 1.3/ 28.4 0.0/ 0.0
Fri-Oct-21-03:33:53 1319182433 8.9 1.7 0.014.3 0.0 1.5/ 35.6 0.0/ 0.0
Fri-Oct-21-03:33:58 1319182438 6.0 1.8 0.017.0 0.0 1.4/ 32.2 0.0/ 0.0
Fri-Oct-21-03:34:03 1319182443 10.3 8.7 0.0 5.6 0.0 1.4/ 35.1 0.0/14.0
Fri-Oct-21-03:34:09 1319182449 3.8 1.8 0.015.2 0.0 3.7/ 27.9 0.0/ 0.0
Fri-Oct-21-03:34:14 1319182454 2.9 1.8 0.020.1 0.0 1.2/ 26.9 0.0/ 0.0
Fri-Oct-21-03:34:19 1319182459 8.7 2.1 0.014.2 0.0 1.3/ 25.0 0.0/ 0.0


Each row represents some statistic averaged over the last five seconds. The first two columns are time stamps. The interesting bit is the next grouping with five columns of poorly-spaced numbers. The first number is the average number of cores doing something in the kernel. We see here that the process has spurts where it will spend ~10 cores in the kernel for ~five seconds, and then that number will back off for 15-30 seconds. The next column counts idle time, and we see a spike in idle time after each spike of kernel time, and that this spike is of similar or perhaps greater magnitude. The 3rd column is all zeros. The 4th column is # of cores spent in userland, and we see again a pulsation between near-full load and then spurts of 5-10 seconds where it shifts first into the kernel and then goes idle. The 5-second interval apparently is too course too catch all of the kernel/idle spikes. The 5th column is # cores spent blocked on I/O, so we can see that there is no measurable waiting for disk.


The other interesting part is the rightmost column. This is average MB/s read and written to disk. What we see here is that the spikes in kernel time correlate with bursts of writes to disk that amount to 14MB/s within that 5-second interval (i.e. a 70MB write -- sounds like one of these swap files).


Now, I should mention that the disk here is a linux software RAID6 array, and that the RAID calculations show up as kernel time. Obvious culprit, but seeing even one whole core worth of CPU time on RAID calculations is a bit rare, and I don't think I've ever seen more than two. We could settle this by running top and looking at the relevent system processes. But I would have to guess that something about the way in which this data is being written out, or perhaps some memory-related operations, are causing the kernel so much fuss. It may be that your code is not really "at fault" per se, but, as a programmer, I look at all that idle time and wonder if there isn't a way to put it to use."

And here is my reply:
"Very interesting question and observations!
Currently, my code is implemented as follows. On each iteration, the matrix operations of multiplying by A' and A are done in parallel. Then all threads are collected, and then the multiplication by A and A' is done in parallel. Then some accounting is done for computing the eigenvector of this iterations, follows by its writing into disk.


I guess is what you see in your monitoring, are the idle periods where the single threads is active doing the accounting and writing the intermediate eigenvectors to disk.


I am sure that further optimizations to the code are possible to utilize the cores better and potentially distribute some of the accounting. Having an access to your machine will help me optimize further. I will look at this some more on my machines, but most effectively will be to look at it on yours. We can start with smaller problems in order not to overload your machine."

Bryan and Anhay where kind enough to grant me access to their thrust machine. So now I am going to try it out
myself!

Next, I got the following email from Ehtsam Elahi
Hi Danny,


I hope you ll be doing great, I wanted know about the GraphLab SVD solver that you have mentioned in your blog. I would like to try out on matrix of about 5 million rows and 400 columns and would like to compare the performance gains over mahout. It would be great if you could guide me to useful resources and the input formating code you have mentioned on your blog. I also want to read the best documentation available for the Java API to Graphlab and would really appreciate if you could direct me to it.


Best,
Ehtsham Elahi

This is very cool! Before ink has dried on my blog post (or whatever..) users are actually want to try out my code. If you want to try the code, those are step you need to follow:

1) Download and install GraphLab on your system, using a checkout from our mercurial repository as instructed here http://graphlab.org/download.html . If you have access to an Ubuntu machine or EC2 machine that would be the easiest
way to start.
2) Prepare your matrix in MatrixMarket sparse matrix format as explained here: http://graphlab.org/pmf.html
Alternatively, if you have already a Mahout sequence file, you can translate it to GraphLab collaborative filtering format
using the instructions here: http://bickson.blogspot.com/2011/08/graphlab-and-mahout-compatability.html
3) Run our new SVD solver using the commend ./demoapps/clustering/glcluster 7 2 0 --matrixmarket=true --scope=null --ncpus=XX
4) When you run Mahout - note that their current Lanczos solver solves only one side of the problem, so you will need to run their solver twice: once for A and once for A'. Our solver runs both iterations in a single pass.

Anyway, I would love to get more comments about this construction from anyone out there! Email me..

No comments:

Post a Comment