Monday, April 28, 2014

Help Aapo finalize his PhD and fill a short survey about GraphChi

Anyone out there who is using GraphChi is highly encouraged to fill in a 5 minute online survey.
My colleage Aapo Kyrola, the creator of GraphChi would like to wrap up his PhD and collect some valuable user feedback. If GraphChi helped you please help Aapo in return!

Thursday, April 24, 2014

Facebook buys GraphChi author's Finland based fitness tracking company

Breaking news from today - my ever surprising Finnish collaborator Aapo Kyrola, who is the author of GraphChi, participated as a founder of a Finnish company for fitness monitoring. Facebook announced today that they are buying their company. Aapo will join FB. Well done Aapo!

MMDS 2014 registration is now open

Registration for the 2014 Workshop on Algorithms for Modern
Massive Data Sets (MMDS 2014) is now available at:

In addition to the talks, there will be a poster session one evening.
You may apply to present a poster at the link above.

Event: MMDS 2014: Workshop on Algorithms for Modern Massive Data Sets
Dates: June 17-20, 2014
Location: UC Berkeley, Berkeley, CA

Synopsis: The 2014 Workshop on Algorithms for Modern Massive Data Sets (MMDS 2014) 
will address algorithmic, mathematical, and statistical challenges in modern 
statistical data analysis. The goals of MMDS 2014 are to explore novel techniques 
for modeling and analyzing massive, high-dimensional, and nonlinearly-structured 
scientific and internet data sets, and to bring together computer scientists, 
statisticians, mathematicians, and data analysis practitioners to promote cross-
fertilization of ideas.

Organizers: Michael Mahoney (UC Berkeley), Alex Shkolnik (Stanford),
Petros Drineas (RPI), Reza Zadeh (Stanford), Fernando Perez (UC Berkeley)

Tuesday, April 22, 2014

MesaGraph acquired by Twitter

I just heard from our french connection, Igor Carron, that the french startup MesaGraph working on graph analytics was just acquired by Twitter. It seems they were working on sentiment analysis. Always great to hear there is a growing demand for graph based analytics.. :-)
Here is the tech crunch report. 

Wednesday, April 16, 2014

New trends in sharing data science work

I got the following venturebeat article from my colleague Carlos Guestrin.

It seems there is an interesting trend of allowing data scientists to share their work: Imagine if a company’s three highly valued data scientists can happily work together without duplicating each other’s efforts and can easily call up the ingredients and results of each other’s previous work. 

That day has come. As the data scientist arms race continues, data scientists might want to join forces. Crazy idea, right? Two San Francisco startups — Domino Data Lab and Sense — have emerged recently with software to let data scientists collaborate on multiple projects. In a way, it’s like code storehouse GitHub for the data science world. A Montreal startup named has been talking about the same themes, but it brings a more social twist. Another startup, Mode Analytics, is building software for data analysts to ask questions of data without duplicating previous efforts. And at least one more mature software vendor, Alpine Data Labs, has been adding features to help many colleagues in a company apply algorithms to code on one central hub. 

If you are interested in learning about trends in sharing data science work you should attend our annual GraphLab conference - we have demos from Domino Data Labs, Sense, & Alpine Data Labs!

An update (April 17) - I have now connected Derek Steer, founder of Mode Analytics and they will also give a demo at our GraphLab event. 

Tuesday, April 15, 2014

Big data analytics front is heating up!

Following my previous blog post about Mahout vs. Oryx. Recent news is that Intel had invested a significant investment in Cloudera and the rumor is that it is going to abandon their Hadoop release.
From the other hand, Mahout  is switching to work on top of Spark. Mahout is backed up by MapR who is backed up by EMC.

GraphChi-DB - new experimental graph database released!

I got the following from my collaborator Aapo Kyrola:

I have just released the source code of GraphChi-DB to GitHub!

The repository is here:

GraphChi-DB is a research project that enhances GraphChi with database functionality:
- Fast queries
- Data columns for edges and vertices
- Fast insertions of new edges and vertices.

Compared to existing single-computer graph databases, it scales to much bigger graphs and - unlike other graph databases -provides the familiar GraphChi programming model (it also provides a rudimentary edge-centric programming model more similar to GraphLab). You can read the publication (below) for a performance evaluation.

It is written in Scala (with some Java). Scala is great language for a database because it has an interactive console (REPL), so you can query and interact with the database directly. GraphChi-DB does not support any query language, but instead it is accessed via the Scala API.

Note: the code is experimental, probably very buggy and has an awful API. Do not use it for anything important! Do not run your Bitcoin exchange with it!
GraphChi-Db also requires some expertise in Scala to be really usable. I would recommend it only to researchers and students at this point. For commercial level graph databases, look for Neo4j or Titan. To get started, look at the example applications (explained in the readme of the project).

The design and evalution of GraphChi-DB can be found from preprint:

Saturday, April 12, 2014

Interesting Graph Applications in Retail

I learned from Amit Steinberg about two additional interesting applications for graph analysis in retail.

Attribution modeling for online marketing
Interesting multichanel attribution work by UPENN:  Analyzing the Customer Journey:Attribution Modeling for OnlineMarketing Exposures in a Multi- Channel Setting

In a nutshell, they build a Markov chain that models the user exposure to different marketing channels and model who the different components help for convergence. 

Botnet detection
The second problem many retailers are facing is to try and filter out botnet behavior vs. real users behavior. One interesting paper in this domain is which uses graphs is:  BotGraph: Large Scale Spamming Botnet Detection. The second paper is  An analysis of social network based sybil deferences.