Large Scale Machine Learning and Other Animals: June 2013

Wednesday, June 26, 2013

Shark @ SIGMOD workshop

Got this from my collaborator Joey Gonzalez, who attended Reynold's Shark talk at SIGMOD GRADES workshop.

Shark is an SQL engine on top of map reduce. Shark exhibits some interesting marriage of analytics with SQL. Here is an example of logistic regression:

And some performance numbers:

Detailed information about experiment setup can be found in their results page. Thanks to my collaborator Chris DuBois for sharing!

Do you want to meet Reynold Xin and ask him about Shark? Well, as you properly guessed right, Reynold will attend our GraphLab workshop Monday July 1st in SF, presenting GraphX, a graph processing system on top of Spark.

Sunday, June 23, 2013

GraphLab Image Processing Toolkit - Image Stitching (CloudCV)

We got some exciting news from Dhruv Batra from Virginia Tech:

Dear Graphlab team,

As most of you know, I was working on the Graphlab computer vision toolbox last summer. The motivation behind it was to provide distributed implementations of computer vision algorithms as a service.

In that spirit, I am happy to announce that that my students and I have a produced a first version of CloudCV.

-- In the first version, the only implemented algorithm is image stitching

-- The front-end allows you to upload a collection of images, which will be stitched to create a panorama.

-- The back-end is a server in my lab running our local repository of graphlab

-- We are currently running stitching in shared-memory parallel mode with ncpus = 3.

-- The 'terminal' in the webpage will show you familiar looking messages from graphlab.

Cheers,

Dhruv

And here are some images you can use if you want to try the image stitching software:

Are you interested in learning more about CloudCV? Join our July 1st SF workshop, where Harsh Agrawal from Virginia Tech will present a demo of CloudCV.

Wednesday, June 19, 2013

Last chance registration to the 2nd GraphLab Workshop

We are having a great demand for this year's 2nd GraphLab workshop (Monday July 1st in SF): already ~~378~~ ~~383~~ 467 registrations and growing quickly. Please register ASAP here: http://glw2.eventbrite.com before we are sold out!

Sell Tickets Online through Eventbrite

Some updates about the agenda. We are constantly working to include the most interesting projects in the area of big data analytics, graph processing and graph databases. Some new content that was added in the last couple of weeks:

New demos:

Zach Solan from Coderbot, one of the most exciting stealth startups, will demo their autmatic machine learning robot, who automatically selects features, algorithms and tunes paramters.
Francisco Martin, Poul Petersen and Adam Ashenfelter will demo bigML
STINGER is a streaming graph analytics package from Georgia Tech.
Jason Riedy from Georgia Tech will demo stinger.
MapR demo - to be announced soon

New posters:

Eriko Nurvitadhi will present a poster of GraphGen
Paul Hoffman from SaffronTech will present a poster about threat prediction for the Gates foundation
Eiko Yoneki (Universityof Cambridge); Amitabha Roy (EPFL) - Scale-up Graph Processing: A Storage-centric View

New talks:

Additional oral talk by S V N Vishwanathan, Purdue - NOMAD: Non-locking stOchastic Multi-machine algorithm for Asynchronous and Decentralized matrix factorization

New sponsors: we would like to thanks the new additional sponsors:

Pandora Internet Radio
Quid.com
Reservoir Labs
Neo4j
LexisNexis will sponsor lunch!
Intel Labs just joined as our Platinum sponsor!!

Our full list of sponsors

Friday, June 7, 2013

Stinger project

My colleague Ben Lorica from O'Reilly Media send me a link to the Stinger project. It is an interesting streaming graph processing framework from Georgia Tech. If you like to learn more about this project, you should attend our 2nd GraphLab workshop, as I have invited Dr. Jason Riedy to present Stinger at our workshop.

Here are the design objectives of Stinger (taken out of their website):

Portability: Algorithms written for STINGER can easily be translated/ported between multiple languages and frameworks
Productivity: STINGER should provide a common abstract data structure such that the large graph community can quickly leverage each others' research developments. This is similar in philosophy to the numerical algorithms community implicit use of sparse and dense matrices.
Performance: It is recognized that no single data structure is optimal for every graph algorithm. The objective of STINGER is to configure a sensible data structure that can run most algorithms well

Sunday, June 2, 2013

Notable presentations at Technion TCE conference 2013: RevMiner & Boom

I found two interesting talks at the TCE conference.

The first by Prof. Oren Etzioni from University of Washington on the RevMiner system. RevMiner is used for mining Seattle text restaurant reviews obtained from Yelp!. The results where published in UIST paper in 2012: http://turing.cs.washington.edu/papers/uist12-huang.pdf

In a nutshell, they look for tokens and their relations in the text, and you can ask questions like "good sushi @ seattle" and find a recommendation based on user reviews. They are trying to compete in Yelp! business prediction contest I wrote about here. I wonder how they system will perform.

They further have some nice UI which is a color bar which summarizes the level of different ratings, and for each recommendation you can understand how it was created by browsing the original reviews.

Another related paper of text mining twitter data: http://turing.cs.washington.edu/papers/kdd12-ritter.pdf

The second interesting talk, by Yoram Singer from Google, is about a system called Boom. The context is classification performed on ads, to decide which ad the user will click. The binary features of the ad are from a very high dimensional space, but are very sparse.

Boom uses a very simple parallel coordinate descent for optimizing a cost function which is a sum of convex functions. The main trick is speeding up Nesterov method of accelerate gradients by using the fact the data is highly skewed.

An update: just got a note from my reader Patrick Durusau, with a link to Boom video lecture. Thanks Patrick!

Large Scale Machine Learning and Other Animals

Wednesday, June 26, 2013

Shark @ SIGMOD workshop

Sunday, June 23, 2013

GraphLab Image Processing Toolkit - Image Stitching (CloudCV)

Wednesday, June 19, 2013

Last chance registration to the 2nd GraphLab Workshop

Friday, June 7, 2013

Stinger project

Sunday, June 2, 2013

Notable presentations at Technion TCE conference 2013: RevMiner & Boom

Labels

GraphLab Users Google Group

pagerank

google analytics

syntax