Friday, September 28, 2012

Bizarre Job Offers

Every now and then I am getting some strange emails with all kinds of job offers.
Yesterday I got an email with the title: "URGENT: Invitation from Portuguese Foundation for Science and Technology". I was flattered to learn from Vanessa, that the Portuguese Foundation for Science and Technology is interested that I will review some project proposal in the area of P2P networks.

So far nothing bizarre here. I continued reading and I got to the compensation part. I was glad to learn that "Furthermore, FCT pays € 70 per proposal evaluated, from which a tax deduction will be applied as required by Portuguese law." There was also an attached link with a 22 page tutorial on how to rate proposals.

The email was sent on Thursday night. I need to return an answer until Sunday if I am willing to rate the proposal. I am almost certain Vanessa does not work on weekends.

Any of my blog readers is Portugese? Let be conservative and assume assume there is 20% tax. We remain with 56 Euro. Which are 72$. Further assume I need one hour to read the proposal ratings guidelines and three hours to read and rate the proposal. So we are talking about 18$ an hour gig!

Now, let's be more realistic. Depositing a check in Euro and converting the money will cost probably 20$ fee. So we remain with 52$ gig.  Reporting a foreign income in end of your tax report will result in overhead of at least 50$ for the accountants. So we remain with a 2$ fee for 4 hours of work.

And now I ask Vanessa. Are you serious? Why don't you ask me to volunteer to review your proposal? It will be much cheaper for me.. :-)

I asked Vanessa to find out what is the applied tax. It may be that the income tax is lower than 20%. In that case all my calculations are wrong. And then I need to reconsider...

Any of my readers which have an advice what to do please comment ASAP. We have only until Sunday to deploy the wisdom of crowdsourcing. 

7th Annual Machine Learning Symposium

I got this email from Danielle:
On October 19, 2012, the New York Academy of Sciences will present the conference, "7th Annual Machine Learning Symposium" at the Academy headquarters in New York City.  This symposium features keynote speakers in applied and theoretical machine learning as well as "spotlight" talks selected from poster abstract submissions.

I write to enquire if you could include a conference announcement about the event on your blog as your professional community would be the ideal audience for this event.

Thank you in advance for your time and consideration.  I look forward to hearing from you in the near future.

Best regards,
Danielle M. Haynes-Amran
Marketing Coordinator: Programs
I looked over their website and indeed it seems they have a serious program. Their keynote speakers are Peter Bartlett, Vladimir Vapnik and William Freeman.

Friday, September 21, 2012

Spotlight: M6D

I met Troy Raeder at ACM KDD CUP in Beijing and asked him to tell me a little about

From a business perspective, we do targeted display advertising.  Essentially, consumer brands contract with us to find strong targets for their advertising and show display ads (banner or video) to these targets on the web and mobile devices.  Within display advertising, there are two common approaches: "retargeting", which is advertising to people who have already visited your brand's website; and "prospecting", which requires finding people who have never been to your site but are good candidates for your brand.  Retargeting is fairly popular.  You've probably noticed it, where you visit a commerce site and then ads for that site follow you elsewhere on the web.  Prospecting, by contrast, can have impressive scale (obviously there are more non-visitors than visitors to your website), and that is the main focus of our business.  More specifically, we use the retargeting population (site visitors) as a seed set from which to build a classification model for predicting brand affinity in the population of potential prospects.
From a machine learning perspective, we solve a huge sparse classification problem, where features are URLs that people have visited and the class (outcome) is some sort of brand action.  Usually we use visits to the brand site as a proxy for brand affinity because it is more common than purchase conversions but less random than clicks.  The coolest thing about our system, at least to me, is that it is automatic and dynamic on a number of levels.  We re-score browsers regularly as we get new data on them, so the set of good prospects for a particular brand is accurate pretty much in real time, and everything we do -- rescoring browsers, retraining models, and adding new brands -- happens very automatically.

We have a number of papers that describe our algorithms and systems in greater detail, including two that will be presented at KDD in China this year.  If you read these, you'll know about as much as you’d care to know about our system.

This is the original paper describing our methods:

Our measurement methodology:

Our large-scale classification system:
Design Principles of Massive, Robust Prediction Systems

And our bid modeling strategy
Bid Optimizing and Inventory Scoring in Targeted Online Advertising

While many companies are very hush hush about their applied ML methods, I find it very useful that m6d actually describes their ML approaches. It helps us to get up-to-date what is happening in the industry.

Spotlight: ModCloth

Here is a quick Q/A session with Marcos Sainz, a fellow CMUer who is data scientist at ModCloth. ModCloth has an office in Pittsburgh and was one of the companies who attended our July workshop. I did the unfortunate mistake of showing my wife their website, an error which immediately cost me 200$.. :-)

 What is the special market / what make modcloth unique?
ModCloth’s mission is to democratize the fashion industry. We seek to do so by empowering our community of shoppers through a social commerce platform that brings products to market with customer feedback and validation. Programs like Be the Buyer™ allow customers to vote items from emerging designers into production. ModCloth has built a loyal community through engaging, interactive contests on our blog and an active involvement in social networks such as Twitter and Facebook, and we’re gaining even more attention for our use of new social platforms, such as Instagram and Pinterest.
What is the size of the company? where are you based?
ModCloth was founded in 2002 when Eric & Susan Koger were only 18 and 17 respectively. In just several years, we've grown from our humble beginnings in a Carnegie Mellon University dorm room to “America’s Fastest-Growing Retailer,” with offices in Pittsburgh, San Francisco, and Los Angeles. In 4 years, ModCloth has grown from 3 to over 300 employees (and growing)! 
ModCloth has more than 500,000 fans on Facebook, over 80,000 followers on Twitter, works with over 700 independent designers, and averages over 500 product reviews daily. Our Be the Buyer™ program has received over 11 million votes so far.
What are some interesting machine learning challenges you face?
A particularly noteworthy challenge we face is the "cold start" problem as it relates to our lean and ever-changing supply chain. In plain words, by the time we have gathered enough data about any given product that we sell on our site, it is too late to use that data for inference and prediction for that product. To make things more exciting, mix in the fact that fashion is a relatively volatile and difficult-to-quantify concept. Change is the only constant.
Would you like to share anything about your solution methods / mode of work ?
Central to the success of some of our Data Science initiatives has been the creation of a Data Warehouse or centralized repository of data originating from multiple systems. Having all or most relevant data in one place in an easily-consumable format makes the life of data analysts/scientists much easier.

Friday, September 14, 2012

The joy of feedback or interesting ML projects out there!

One of my greatest pleasures is keeping in touch with our user community and finding out what interesting projects are out there. Here are some small examples from the LAST WEEK. I hope this will encourage further people to contact me!!

Phillip Trotter, ptav: 
This is just a short note to say huge thanks for the GraphChi posts on your blog.

I have  small start up developing complex systems simulation and visualization tools and we are in the early stages of using GraphChi for both feature identification in network graphs and for data validation following simulations. We have potentially a lot of other uses for GraphChi and I am really enjoying getting to know the code base. A big part of that has been your articles and i just wanted to say thank you for the posts - they have been incredibly helpful.

Ben Mabey, Red brain labs
Thanks for all your work on graphlab.  I think it is one of the most interesting ML projects in a long time.  Between graphlab's CF library and libfm I haven't been this excited about recsys since the FunkSVD. :) At my work we're excited about the performance of graphlab since it allows for faster experimentation and cheaper builds.

As a side hobby project I'm trying to develop a recsys for github repositories. The majority of the user feedback is implicit (e.g. commits to repos, opened issues, comments, etc) with the only explicit feedback (staring of a repo) is binary.  Since GraphLab's CF toolkit already implements some of the papers that I had read dealing with implicit feedback it seemed like a great way to get started.
thanks for the good work on the GraphLab project. --Norm

Paul Lefkopoulos,
Thanks for the work. We are a fresh French startup specialized in web analytics and digital media optimization. One of our main aim is to increase the website conversion rate of our retailing customers. We currently use your tool to build a recommender system for a small retailer in order to increase its cross-sell.

 I really appreciate such a responsive and helpful forum!  Thank you, thank you.

Shingo Takamatsu, Sony Japan
I have read your papers on GraphLab (and GraphChi) and implemented a few algorithms on GraphLab (Danny’s blog was great help!).

Xiaobing Liu, Tencent China

I am a senior engineer at which is largest internet companies in
China. I am interested in distributed/parallel large scale machine learning algorithms and applications in search advertising targeting and the relevance of the search engine. So for, I implement parallel logistic regression in batch learning way and in online learning way, which are applied in search ad-ctr prediction and also I implemented some topic models, such as LDA and PLSA all of algorithm bases on MPI. I am quite excited about parallel framework, such as pregel and graphlab.Currently, I am implementing the PLSA and logistic regression base on graphlab.

Zygmunt Zając
I've been using Graphlab mainly for data sport at Kaggle, currently I'm in TOP 100 among 55 thousand registered users. I also write about Kaggle competitions and machine learning in general at I generally like trying new machine learning software and I found Graphlab to be fast, relatively easy to use and comprehensive (a lot of options and algorithms, built-in validation etc.).  
Ayman M Shalaby, University of Waterloo, Canada
I've read heard about the Graphlab project and the project sounds very promising. 
I'm particularly interested in using Graphlab to implement large scale Dynamic Bayesian Networks. 

Steffen Rendle, Author of the libFM collaborative filtering package
I am impressed by the size of the Graphlab project and the efforts you make to scale the algorithms to multiple cores/ machines. I think this will get a very important topic with future increasing parallelism of computers.

And how can we forget our mega collaborator JustinYan?
I am Qiang (Justin) Yan ,a master student from the Chinese Academy of Sciences. My main focus is large scale data mining and Collaborative Filtering. I was working as an intern to build Recommender system for Baidu and Hulu in last couple of years. Recently I focus on implementing Collaborative Filtering Models on GraphLab and GraphChi.

Thursday, September 13, 2012

GraphChi parsers toolkit

Consecutive IDS parser

If you like to use GraphChi collaborative filtering library, you first need to parse your data into consecutive user integer ids and consecutive item integer ids.
For example, assume you have a user movie rating database with the following format:
Johny Walker,The Matrix (Director's Cut),5
Abba Babba,James Bond VII,3
Bondy Jr.,Titanic,1
Namely, each row contains one ratings with the format
<user name>,<item name>,<rating>

And you like to convert it to sparse matrix market format:
1 1  5
2 2  3
3 3  1

Namely user 1 rated item 1 with rating of 5, etc.

Additionally, you can also convert files of the format:
12032 12304-0323-X 3
Where for example 12032 is the first user ID, 12304-0323-X is the ISBN of the rated book and 3 is the rating value.

You can use the consecutive_matrix_market parser to create the appropriate format from your data files. 

The input to the parser is either CSV file, TSV file or a matrix market input file with non consecutive integer ids. The user and items can be either strings or integers.

There are several outputs to the consecutive IDS parser:
1) filename.out - a rating file with the format [user] [ movie] [rating] where both the user ids and item ids are consecutive integers. In our example

1 1  5
2 2  3
3 3  1

2) Mapping between user names/ids to consecutive integers. In our example:

Abba Babba 2
Bondy Jr. 3
Johny Walker 1
3) Mapping between movie names/ids to consecutive integers. In our example:

James Bond VII 2
The Matrix (Director's Cut) 1
Titanic 3

4+5) The same maps but in reverse, namely mapping back between unsigned integers to string ids.
6) Matrix market header file - in our examples there are 3 users, 3 movies and 3 ratings:

%%MatrixMarket matrix coordinate integer general
3 3 3

To run the consecutive IDS parser you should prepare a file with the input file name within it.
For example, the file name "movies" contains:

Johny Walker,The Matrix (Director's Cut),5
Abba Babba,James Bond VII,3
Bondy Jr.,Titanic,1

Finally, you run the parser:
./toolkits/parsers/consecutive_matrix_market --training=movies --csv=1

The supported command line flags are:
--csv=1 - csv format
--tsv=1 - tsv format
-> otherwise - sparse separated format

--file_list=XXX - input file which contains list of files to  parse (each file in a new line) [optional]

--single_domain - if set to 1, both from and to ids are from the same set of ids, if set to 0, from ids get a different range, and to ids get a different range. This is useful for bipartite graphs (Default is 0);

--binary - if set to 1, edges have no weight. If set to 0, edge weight is expected. (Default is 0).

--debug=1 - more verbose output (recommended). Default is 0.

Adj Parser

Adj format is explained here:

The Adjacency list file format stores on each line, a source vertex, followed by a list of all target vertices: each line has the following format:
[vertex ID]  [number of target vertices] [target ID 1] [target ID 2] [target ID 3] ...
This format is more compact to store, and in practice will partition better in the distributed setting (since edges have a natural grouping). Furthermore, this format is capable of storing disconnected vertices.
For instance, the following graph:
can be stored as:
1 2 2 5
7 2 7 1
5 1 7

The parser converts the Adj file into GraphChi CF compatible format.
For example, assume we have a file with 2 lines:

2884424247 11 1210682095 1789803763 1879013170 1910216645 2014570227 2109318072 2268277425 2289674265 2340794623 2513611825 2770280793
2885009247 16 1420862042 1641392180 1642909335 1775498871 1781379945 1784537661 1846581167 1934183965 2011304655 2016713117 2017390697 2128869911 2132021133 2645747085 2684129850 2866009832
Namely, one node with 11 edges and another node with 16 edges.

Assume the file name is b. We create a file named a with the first line contains "b" (without the qoutes!)
./toolkits/parsers/adj --file_list=a --single_domain=1
Starting program: /home/bickson/graphchi/toolkits/parsers/snap --file_list=a --single_domain=1
[Thread debugging using libthread_db enabled]
WARNING:  snap.cpp(main:204): GraphChi parsers library is written by Danny Bickson (c). Send any  comments or bug reports to 
[file_list] => [a]
[single_domain] => [1]
INFO:     snap.cpp(parse:196): Finished parsing total of 3 lines in file b
total map size: 29
Finished in 0.000591
INFO:     snap.cpp(save_map_to_text_file:74): Wrote a total of 29 map entries to text file:
INFO:     snap.cpp(save_map_to_text_file:86): Wrote a total of 29 map entries to text file:
INFO:     snap.cpp(main:255): Writing matrix market header into file:

Now we look at the output:
bickson@thrust:~/graphchi$ cat b.out 
1 2
1 3
1 4
1 5
1 6
1 7
1 8
1 9
1 10
1 11
1 12
13 14
13 15
13 16
13 17
13 18
13 19
13 20
13 21
13 22
13 23
13 24
13 25
13 26
13 27
13 28
13 29

We also got a matrix market header that can be used in graphchi.
bickson@thrust:~/graphchi$ cat 
%%MatrixMarket matrix coordinate integer general
29 29 27

Additionally map file between the old ids and new ids is stored in the directory.

LDA Parser

To the request of Baoqiang Cao I have started a parsers toolkits in GraphChi to be used for
preparing data to be used in GraphLab/ Graphchi. The parsers should be used as template which can be easy customized to user specific needs.
LDA parser code is found here.

Example input file:

I was not a long long ago here
because i am not so damn smart.
as you may have thought

The assumption is that each line holds words from a certain document.  We would like to assign the strings unique ids, and count how many words appear in each document. 

The input to GraphLab's LDA is in the format
[docid] [wordid] [word count]

Example run:
1) Create a file "stamfile" with the example input above.
2) Create a file "a" which has the file name "stamfile" in its first line.
>>head d
3) Run the parser:
bickson@thrust:~/graphchi$ ./toolkits/parsers/texttokens --file_list=a --min_threshold=1 --max_threshold=10
WARNING:  texttokens.cpp(main:146): GraphChi parsers library is written by Danny Bickson (c). Send any  comments or bug reports to 
[file_list] => [a]
INFO:     texttokens.cpp(parse:138): Finished parsing total of 4 lines in file stamfile
total map size: 16
Finished in 0.024819
INFO:     texttokens.cpp(save_map_to_text_file:56): Wrote a total of 16 map entries to text file: amap.text
INFO:     texttokens.cpp(save_map_to_text_file:68): Wrote a total of 16 map entries to text file:

Explanation: the min_threshold filters output to words that appear at least min_threshold times, and the same with max_threshold.

The output of the parser is a file named stamfile.out
bickson@thrust:~/graphchi$ cat stamfile.out 
2 1 1
2 2 1
2 3 1
2 4 2
2 5 1
2 6 1
3 11 1
3 3 1
3 7 1
3 8 1
3 9 1
3 10 1
4 12 1
4 13 1
4 14 1
4 15 1
4 16 1

And the mapping files:
bickson@thrust:~/graphchi$ cat amap.text
may 14
long 4
am 8
because 7
you 13
as 12
so 9
here 6
have 15
not 3
i 1
smart 11
thought 16
damn 10
was 2
ago 5

bickson@thrust:~/graphchi$ cat 
1 i
2 was
3 not
4 long
5 ago
6 here
7 because
8 am
9 so
10 damn
11 smart
12 as
13 you
14 may
15 have
16 thought

Twitter parser

A second parser, which is slightly more fancy, go over user twitter tweets in the following format:

 * Twitter input format is:
 * T  2009-06-11 16:56:42
 * U
 * W  Bus is pulling out now. We gotta be in LA by 8 to check into the Paragon.
 * T  2009-06-11 16:56:42
 * U
 * W  灰を灰皿に落とそうとすると高確率でヘッドセットの線を根性焼きする形になるんだが
 * T  2009-06-11 16:56:43
 * U
 * W  @carolinesweatt There are no orphans...of God! :) Miss your face!
And extracts graph links between the retweets.

Aggregate parser

This parser computes the aggregate of the requested column.
The input is a sorted graph:
1 2 4 
1 2 4 
2 3 3 
3 4 2 
3 4 2 

And the output is the aggregate of the requested column (column 3 in the example):

1 2 8
2 3 3 
3 4 2 
3 4 4

Operating the parsers

1) Install graphchi as explained in steps 1+2 here.
2) Compile with "make parsers"
3) Run using ./toolkits/parsers/texttoken --file_list=files

Cascading several parsers together

I got this from Stefan Weigert:
I would like to use GraphLab for a top-k algorithm. There is a communication-log which i want to process as input-file. Assume, the log contains the following entries :

YVjAeZQjnVA IfrTTVlatui 050803 156
YVjAeZQjnVA IfrTTVlatui 050803 12
GNgrmichxmG GNgriWokEhN 050803 143
YnRdCKZkLao MHexzaXWCPL 050803 0
RGNReqpKcZw RGNRSTDdqew 050803 0
LPHSeuGhYkN ZFwbovKzAxY 050803 1
sijmyRRfkwl XtqJaHYFEPqbZqNGPCr 050803 68
RGNReqpKcZw RGNRSTDdqew 050805 6
RGNReqpKcZw RGNRSTDdqew 050805 6
sijmyRRfkwl XtqJaHYFEPqbZqNGPCr 050805 12

Where each line has the following format:
[ caller id] [ receiver id ] [date ] [call duration]

What i wanted to do is
1) load the first line, add an edge YVjAeZQjnVA-> IfrTTVlatui with 156 as attribute
2) load the second line, see that YVjAeZQjnVA-> IfrTTVlatui exists already and only change the edge-attrib to 168 (168 = 156+12)

Proposed solution:

1) Run consecutive ids parser to convert the strings to consecutive ids.
    a) Assume the input file is named "test". Prepare a list file named "files" with the string "test" in its first line.
   b) run consecutive parser with:

./toolkits/parsers/consecutive_matrix_market --file_list=files
WARNING:  consecutive_matrix_market.cpp(main:177): GraphChi parsers library is written by Danny Bickson (c). Send any  comments or bug reports to 
[file_list] => [a]
INFO:     consecutive_matrix_market.cpp(parse:169): Finished parsing total of 10 lines in file test
total map size: 6
Finished in 0.000117
INFO:     consecutive_matrix_market.cpp(save_map_to_text_file:69): Wrote a total of 6 map entries to text file:
INFO:     consecutive_matrix_market.cpp(save_map_to_text_file:81): Wrote a total of 6 map entries to text file:
INFO:     consecutive_matrix_market.cpp(save_map_to_text_file:69): Wrote a total of 6 map entries to text file:
INFO:     consecutive_matrix_market.cpp(save_map_to_text_file:81): Wrote a total of 6 map entries to text file:
INFO:     consecutive_matrix_market.cpp(main:225): Writing matrix market header into file:
   c) The output is saved into the file: test.out
1 1 050803 156
1 1 050803 12
2 2 050803 143
3 3 050803 0
4 4 050803 0
5 5 050803 1
6 6 050803 68
4 4 050805 6
4 4 050805 6
6 6 050805 12

2) Sort the output file:
sort -k 1,1 -k 2,2 -g test.out > test.sorted
The sorted file is:
1 1 050803 156
1 1 050803 12
2 2 050803 143
3 3 050803 0
4 4 050803 0

3) Run the aggregator parser:
   a) edit the file "files" to contain the string test.sorted in the first line.
   b) Run the aggregator parser:
   bickson@thrust:~/graphchi$ ./toolkits/parsers/aggregator --file_list=a --col=4
WARNING:  aggregator.cpp(main:151): GraphChi parsers library is written by Danny Bickson (c). Send any  comments or bug reports to 
[file_list] => [a]
[col] => [4]
INFO:     aggregator.cpp(parse:137): Hash map size: 0 at time: 6.4e-05 edges: 1
INFO:     aggregator.cpp(parse:137): Hash map size: 0 at time: 8.6e-05 edges: 2
INFO:     aggregator.cpp(parse:137): Hash map size: 0 at time: 9.6e-05 edges: 3
INFO:     aggregator.cpp(parse:137): Hash map size: 0 at time: 0.000103 edges: 4
INFO:     aggregator.cpp(parse:137): Hash map size: 0 at time: 0.000109 edges: 5
INFO:     aggregator.cpp(parse:137): Hash map size: 0 at time: 0.000114 edges: 6
INFO:     aggregator.cpp(parse:137): Hash map size: 0 at time: 0.000119 edges: 7
INFO:     aggregator.cpp(parse:137): Hash map size: 0 at time: 0.000124 edges: 8
INFO:     aggregator.cpp(parse:137): Hash map size: 0 at time: 0.000129 edges: 9
INFO:     aggregator.cpp(parse:137): Hash map size: 0 at time: 0.000134 edges: 10
INFO:     aggregator.cpp(parse:143): Finished parsing total of 11 lines in file test.sorted
total map size: 0
Finished in 0.000559
c) Now the output is:
1 1 168
2 2 143
3 3 0
4 4 12
5 5 1
6 6 80

We are done!

ips2ids parser

This parser converts graph in the string ip address format (A.B.C.D) into consecutive id format.
Example input file: 12 6


This parser takes a text file with unsigned int in each line and counts how many instances of each unsigned int are there.

For example, an input file named "in":


  The output of "./count in"
  1 1
 2 2
 3 8

Acknowledgements/ hall of fame

Tuesday, September 11, 2012

Interesting paper: energy models for graph clustering

I got today a question by Timmy Wilson, our man in Ohio, about the paper: Energy Models for Graph Clustering by Andreas Noack. This paper has a nice treatment for power law graph visualization (like social networks). In traditional layouts, the popular nodes which have a lot of links have larger importance and thus are visualized in the center, so we get a messy layout:
The paper proposes to use edge repulsion model instead of node repulsion model, thus highly connected nodes are no longer stacked in the middle:
This example is taken from the above paper and visualizes links between terms in an online dictionary.

The layout cost function is:

    min sum_over_edges_u,v || p(u) - p(v) || - sum_over_all_pairs_u,v deg(u) deg(v)  log || p(u) - p(v) ||

Where ||x|| is the Euclidian distance, and p(u) is the layout position of node u, deg(u) is the degree of node u.
The cost function is composed of two terms. The first tries to pull visualized nodes which are closed in the graph to be close together. The second term tries to place all pairs of nodes as far as possible from each other. The trick is that high degree nodes should be more far away since their degree is high and thus two close high degree nodes have a lot of effect on the cost function.

Regarding the solution, it is rather straightforward to implement the method using gradient descent. The problem is that the first term in the cost function fits very nicely to graphlab/graphchi since it
is local over the graph edges. However, the second term computes distance over all pairs of nodes, this is not a local operation and does not fit well within graphlab.

Wednesday, September 5, 2012

NIPS big learning workshop

My collaborators Joey and Yucheng are organizing this year NIPS big learning workshop. They also mentioned that some skying may be involved.. :-)

Big Learning 2012: Algorithms, Systems and Tools

NIPS 2012 Workshop

Sameer Singh <> (UMass Amherst)
John Duchi <> (UC Berkeley)
Yucheng Low <> (Carnegie Mellon University)
Joseph Gonzalez <> (UC Berkeley)

Submissions are solicited for a one day workshop on December 7-8 in Lake Tahoe, Nevada.

This workshop will address algorithms, systems, and real-world problem domains related to large-scale machine learning (“Big Learning”). With active research spanning machine learning, databases, parallel and distributed systems, parallel architectures, programming languages and abstractions, and even the sciences, Big Learning has attracted intense interest. This workshop will bring together experts across these diverse communities to discuss recent progress, share tools and software, identify pressing new challenges, and to exchange new ideas. Topics of interest include (but are not limited to):

Big Data: Methods for managing large, unstructured, and/or streaming data; cleaning, visualization, interactive platforms for data understanding and interpretation; sketching and summarization techniques; sources of large datasets.

Models & Algorithms: Machine learning algorithms for parallel, distributed, GPGPUs, or other novel architectures; theoretical analysis; distributed online algorithms; implementation and experimental evaluation; methods for distributed fault tolerance.

Applications of Big Learning: Practical application studies and challenges of real-world system building; insights on end-users, common data characteristics (stream or batch); trade-offs between labeling strategies (e.g., curated or crowd-sourced).

Tools, Software & Systems: Languages and libraries for large-scale parallel or distributed learning which leverage cloud computing, scalable storage (e.g. RDBMs, NoSQL, graph databases), and/or specialized hardware.

Submissions should be written as extended abstracts, no longer than 4 pages (excluding references) in the NIPS latex style. Relevant work previously presented in non-machine-learning conferences is strongly encouraged, though submitters should note this in their submission.

Submission Deadline: October 17th, 2012.
Please refer to the website for detailed submission instructions:


I got this link from Igor Caron. The famous compressed sensing and matrix factorization blogger.

In one of the thread, there was a discussion about recommender capabilities. Since we were looking at as a model (this is a Reddit clone), I went to the reddit discussion of the development of that open source platform and found that they, Reddit, actually are looking for a recommeder system and they have a nice dataset
There are 23,091,688 votes from 43,976 users over 3,436,063 links in 11,675 reddits. (Interestingly these ~44k users represent almost 17% of our total votes). The dump is 2.2gb uncompressed, 375mb in bz2.A reddit is a category. A link is a subject (in Arxaliv it would be a paper) so that matrix (43976 x 3436063) is pretty sparsely filled (1.5e-5). Some SVD has been tried but I am sure they haven't looked at low rank solvers. Since Reddit is such a massive platform, if your algorithm provides good results, it will get to be known beyond your expectations. 

Sunday, September 2, 2012

Installing GraphLab 2.1 on Mountain Lion (MAC OS X 10.8)

Here is a quick explanation on how to install GraphLab 2.1 on Mountain Lion.

1) Install XCODE 4.4  
2) After XCODE is installed, go to XCode-> Preferences->Downloads 
and install "command line tools".

3) Install mercurial .
4) Go to Graphlab download page and follow the link to the mercurial code page.
    Open a new terminal and use the command to checkout the code:
    hg clone https://...
5) cd graphlabapi

6) For compiling the collaborative filtering library:
    cd release/toolkits/collaborative_filtering/
    make -j4

Harry Potter Effect on Recommendations

Just learned a new thing from my mega collaborator Justin Yan. The Harry Potter effect is
a term in psychology for a phenomena which is very popular (everyone likes Harry Potter).

When computing recommendations (for example movies) we can compare movie similarity
to recommend to the user similar movies to the one she saw. The problem, when we have an item which is liked by everyone, it "covers" any other subset of the data, and thus it will have the greatest similarity to all other items. For example, the movie Tarzan was watched 10 times. Harry Potter was watched 1,000,000 times. Every single time a viewer watched Tarzan, she watched also Harry Potter! So Harry Potter is the most similar movie to Tarzan. And to all the other movies as well...

To prevent this Harry Potter effect, we normalize by the total number of ratings in the data. Thus if we had an overlap of 10 watches of Tarzan with Harry Potter, we divide it to 1,000,000 occurrences of  Harry Potter ratings, and thus we diminish the "Harry Potter" effect.