Large Scale Machine Learning and Other Animals: January 2014

Tuesday, January 28, 2014

Vote for GraphLab for best machine learning startup - Gigaom structure data awards

If you like GraphLab, please spend 2 seconds of your time by clicking on the Gigaom Survey. Your help is much appreciated!!

Weird dataset: identifying sexual predators in chat rooms

To all of the bored data scientists who are looking for interesting demo. (Alternatively, to all the startups who want to do a fraud detection demo). I stumbled upon this weird dataset which was part of PAN 2012 conference: identifying sexual predators in chat rooms.

A less bizarre dataset is the beer classification dataset reported in William M. Briggs blog. It is the classical geeky cool dataset since it shows you are a data scientist who likes beer.

Additional borderline dataset is how clean are SF restaurants dataset I wrote about before.

Sunday, January 26, 2014

Yeppp! math library

I got this from my colleague Chris DuBois: Yeppp! is a new math library from Georgia Tech. According to their benchmark page they have impressive performance vs. other packages including intel MKL. Yeppp! supports C/C++/Java and Fortran interfaces. It is licensed using the creative commons 3 license.

Friday, January 24, 2014

Parquet: efficient column store on Hadoop

I got this from my collaborator Joey Gonzalez. Cloudera is backing up Apache Parquet, an efficient column store on top of Hadoop. Which is an open source version of Google Dremel.

If you like to hear more about Cloudera's vision about new trends data science, you should attend our 3rd GraphLab Conference to hear Josh Wills, director of data science @ Cloudera.

Wednesday, January 22, 2014

Big data faculty positions at Emory University

I got this from Eugene Agichtein, a Prof. at Emory University:

We are starting a 6-faculty "big data" faculty search, with machine learning and systems being key focus areas. The position ad is here:http://www.mathcs.emory.edu/uploaded-files/Emory-DS-Ad.pdf. We are starting to review candidates soon.

Emory U is in Atlanta, GA, and has been a consistently ranked in the top 20 of U.S. universities, and while traditionally has been focusing on life sciences research, is now seriously expanding computational-* programs across all schools. The CS Phd program was only started in 2007, but has managed to attract pretty good students.

Tuesday, January 21, 2014

TunkRank on GraphLab

Just learned that students from EPFL have implemented an algorithm called TunkRank on top of GraphLab. An explanation of the algorithm is available here.

The advising researcher is Prof. Mike Ferdman, From Stoney Brook University. This work was done as part of the CloudSuite project.

HP Titan - a notable presentation by Dr. Ira Cohen, HP Software

A great and impressive talk by Ira Cohen, CTO of HP Software at our applied ML meetup yesterday. HP got to the conclusion that they can not hire enough data scientists. So they set an operation where smart programmers are educated to use data science tools. First the programmers undergo a 5 days applied ML course. Then they are supplied with Titan which is basically ML tooset for dummies. Titan have 4 conceptual steps:

1) Data import - connects to data sources like twitter, salesforces, web, database etc

2) Data filter - automatic data filtering and normalization, user selects the interesting target to predict.

3) Data analytics - the systems suggests automatically which ML methods to use. The user just clicks then ones that fit. (Very high level - like classify, regress etc.).

Once a topic is selected, few algos are run in parallel and the results shown to the user.

4) Publish - once the user is happy with the results, he can publish in one of several visual forms like graphs, geographical maps etc. The publish creates either an interactive web page or pdf with the results.

The results are very impressive. around 70 programmers had the ML training. In 4 months they have created around 30 projects which many of them are pushed towards deployment in production.

One case study he gave is customer leads prediction. You simply select data source = salesforce, you select the target (sell/ no sell), the ML method (classify), after a few minutes you an interactive application with zoomable US maps that shows you sales predictions. Everything is highly visual and appealing.

Benchmarking GraphLab vs. Hadoop

We often are requested to cite some external evaluation which benchmark GraphLab vs. other frameworks. Here is a collection of some resources we are aware of. Would love getting additional.

A relatively new poster presentation (appeared in SC 13) is:
Towards Benchmarking Graph-Processing Platforms. by Yong Guo (Delft University of Technology), Marcin Biczak (Delft University of Technology), Ana Lucia Varbanescu (University of Amsterdam), Alexandru Iosup (Delft University of Technology), Claudio Martella (VU University Amsterdam), Theodore L. Willke (Intel Corporation), in Super Computing 13 pdf

The above Figure from this paper shows that GraphLab is the fastest of all the compared system on BFS task (pink line).

Recently I got from Matt Grober (Walmart) a more recent paper from the same group: "How Well do Graph-Processing Platforms Perform? An Empirical Performance Evaluation and Analysis" Y. Guo, M. Biczak, A. Varbanescu, A. Iosup, C. Martella, T. Willke IEEE International Parallel and Distributed Processing Symposium IPDPS 2014. pdf

Other resources:

Recent academic paper which benchmarks GraphLab vs. Mahout: http://bickson.blogspot.com/2013/07/benchmarking-of-machine-learning.html
Facebook report where they admit they did an evaluation of graphlab (but no real performance is reported) https://www.facebook.com/notes/facebook-engineering/scaling-apache-giraph-to-a-trillion-edges/10151617006153920
Amazon CTO writes about GraphLab in his blog: http://bickson.blogspot.co.il/2013/08/amazons-cto-writes-on-graphlab-in-his.html
Intel labs report on graphlab vs. mahout: http://bickson.blogspot.com/2013/03/intel-labs-report-on-graphlab-vs-mahout.html

Monday, January 20, 2014

GraphLab tutorial @ the iSocial Summer School in Stockholm June 2-4, 2014

To all of my Nordic readers, I was invited to give a GraphLab lecture at the iSocial summer school in Stockholm, June 2-4, 2014. Will be happy to connect with anyone interested.

Wednesday, January 15, 2014

Petuum - a new distributed machine learning framework from CMU (Eric Xing)

From their website:

Petuum is a distributed machine learning framework. It takes care of the difficult system "plumbing work", allowing you to focus on the ML. Petuum runs efficiently at scale on research clusters and cloud compute like Amazon EC2 and Google GCE.

View on GitHub

A Bit More Details

Petuum provides essential distributed programming tools that minimize programmer effort. It has a distributed parameter server(key-value storage), a distributed task scheduler, and out-of-core (disk) storage for extremely large problems. Unlike general-purpose distributed programming platforms, Petuum is designed specifically for ML algorithms. This means that Petuum takes advantage of data correlation, staleness, and other statistical properties to maximize the performance for ML algorithms.

Plug and Play

Petuum comes with a fast and scalable parallel LASSO regression solver, as well as an implementation of topic model (Latent Dirichlet Allocation) and L2-norm Matrix Factorization - with more to be added on a regular basis. Petuum is fully self-contained, making installation a breeze - if you know how to use a Linux package manager and type "make", you're ready to use Petuum. No mucking around trying to find that Hadoop cluster, or (worse still) trying to install Hadoop yourself. Whether you have a single machine or an entire cluster, Petuum just works.

What's Petuum anyway?

Petuum comes from "perpetuum mobile," which is a musical style characterized by a continuous steady stream of notes. Paganini's Moto Perpetuo is an excellent example. It is our goal to build a system that runs efficiently and reliably -- in perpetual motion

Saturday, January 11, 2014

MLConf NY event - April 11, 2014

Another fresh news: the next MLConf event will be held in NY, on April 11, 2014, with quite an impressive line of speakers. A special registration discount for this blog readers: use the promotion code: Dannysblog

Presentations from:

Corinna Cortes, Head of Research, Google

Josh Wills, Director of Data Science, Cloudera

Claudia Perlich, Chief Scientist, Dstillery

Pek Lum, Chief Data Scientist, Ayasdi

Samantha Kleinberg, Computer Science department, Stevens Institute of Technology.

Irene Lang, Math Hacker, OxData

Anqi Fu, Data Scientist, OxData

Shan Shan Huang, VP Product Management, LogicBlox, Inc.

Animashree Anandkumar, Electrical Engineering and Computer Science Dept, UC Irvine.

Thursday, January 9, 2014

Sense: collaborative data science in the cloud

I got this from my collaborator Joey Gonzalez:

I am very proud of myself - the video was published today - so my readers really get the freshest news :-)

If you like to learn more on the Sense Platform - you should attend our 3rd GraphLab Conference.

Wednesday, January 8, 2014

Linkurious - a new graph visualization tool

I got this from Sebastien Heymann, one of the creators of Gephi, the popular open source graph visualization package. Linkurious is a new startup creating graph visualization tools. Here is a lecture from a couple of months ago from Neo4j event:

How to Search, Explore and Visualize Neo4j with Linkurious - Jean Villedieu @ GraphConnect SF 2013 from Neo Technology on Vimeo.

ScaleGraph: a new graph processing system

I got this from Prof. Toyotaro Suzumura from the Tokyo Institute of Technology. ScaleGraph is a graph processing system written using X10 distributed programming language. ScaleGraph is open source software using Eclipse License.

From their website: "Recently large-scale graphs with billions of vertices and edges have emerged in a variety of domains and disciplines especially in the forms of social networks, web link graphs, internet topology graphs, etc. Mining these graphs to discover hidden knowledge requires particular middleware and software libraries that can harness the full potential of large-scale computing infrastructures such as super computers. ScaleGraph is a graph library based on the highly productive X10 programming language. The goal of ScaleGraph is to provide large-scale graph analysis algorithms and efficient distributed computing framework for graph analysts and for algorithm developers, respectively."

I have invited ScaleGraph to present at our 3rd GraphLab conference.

Registration for the 3rd GraphLab Conference is open!

Join us for a full day of the latest and greatest applied machine learning and big data analytics!
Monday July 21, 2014 at the Nikko Hotel SF. Confirmed speakers (very preliminary list): GraphLab, Google, Trifacta, Datapad, Databricks (Spark), Pandora Internet Radio, Cloudera. Confirmed demos: QuantiFind, bigML, Skytree, YarcData, Saffrom Technology, Franz.

Additional information
Registration

Special discount code for my blog readers: twentyoff

Additional content to be announced soon!

Tuesday, January 7, 2014

GraphLab Tutorial @ Strata

Only 2 days left for early bird discount for Strata Santa Clara Tutorials registration (deadline: January 9). We are going to have a 3 hour long machine learning with GraphLab tutorial, given by GraphLab CEO, Prof. Carlos Guestrin.

Additional good news, I have just received discount code for my blog readers from Ben Lorica, our main at O'Reilly Media. Discount code is: DATO20 - for getting 20% discount.

Monday, January 6, 2014

Plotly - a nice visualization package in Python

Got this from my colleague Chris Dubois: a Python Plotly tutorial with some examples. The graphs are interactive, you can zoom in/out, mouse select etc.

Here is one example plot (just a screen capture):

Saturday, January 4, 2014

Reverse engineering Netflix customized categories

I got this blog post from my colleague Eric Wolfe: "How Netflix Reverse Engineered Hollywood". A very comprehensive (and slightly tedious ) study of how to customized genres are created by Netflix.

A startup called Quantifind has taken this construction a step further, and now predicts a future success of a movie based on the customized genre. I hear they are working with some of the largest studios.

Friday, January 3, 2014

Do you like robots?

Got this nice illustration for Elly Brown, from http://bestcomputersciencedegrees.com

Source: BestComputerScienceDegrees.com

Thursday, January 2, 2014

SIMR - Spark on top of Hadoop

Just learned from my collaborator Aapo Kyrola that the Spark team have now released a plugin which allows running Spark on top of Hadoop, without installation anything and without administrator privileges. This will probably encourage many more companies to try out Spark, which significantly improves on Hadoop performance.

Another related link, from my colleague Chris Dubois, shows that Cloudera is backing up Spark.