Tuesday, December 11, 2012

Collaborative Filtering - 3rd Generation [or] winning the kdd cup in 5 minutes!

NOTE: This blog post is two years old. We have reimplemented this code as part of Graphlab Create. The implementation in GraphLab Create is preferred since:
1) No input format conversions are needed (like matrix market header setup)
2) No parameter tuning like step sizes and regularization are needed
3) No complicated command line arguments
4) The new implementation is more accurate, especially regarding the validation dataset. 
Anyone who wants to try it out should email me, I will send you the exact same code in python.


After spending a few years writing collaborative filtering software with thousands of installations, and after talking to tens of companies and participating in KDD CUP twice,  I have started to develop some next generation collaborative filtering software. The software is very experimental at this point and I am looking for the help of my readers - universities and companies who would like to try it out.
[NOTE: I HAVE ADDED SOME UPDATES BELOW ON THURSDAY DEC 13]

The problem:

Most collaborative filtering methods (like ALS, SGD, bias-SGD, NMF etc.) use the rating values for computing matrix factorization. A few "fancier" methods (like tensor-ALS, time-SVD++ etc. ) utilize also the temporal information to improve the quality of predictions. So basically we are limited to 2 or 3 dimensional factorization. Typically the utilized data is of the type:
[ user ] [ item ] [ rating ] 
or
[ user ] [ item ] [ time ] [ rating ] 

I am often asked, how to approach problems when you have data of the type:

[ user ] [ item ] [ item category] [ purchase amount ] [ quantity ] [ user age ] [ zip code ] [ time ] [ date ] ... [ user rating ]

In other words, how do we utilize additional information we have about user features, item features, or even more fancier feature like user friends etc. This problem is often encountered in practice and in many cases, papers are written about it by doing specific constructions. See for example Koenigstein's paper. However, in practice, most users do not like to break their heads and invent novel algorithms but want to have a readily accessible method that can take more features into account and without much fine tuning.

The solution:

Following the great success of libFM, I thought about implementing a more general SGD method in GraphChi that case take a list of features into account.

A new SGD based algorithm is developed with the following
1) Support for string features (John Smith bought the Matrix)
2) Support for dynamic selection of features on runtime.
3) Support of multiple file formats with column permutation.
4) Support for an unlimited number of features
5) Support for multiple ratings of the same item.

Working example - KDD CUP 2012 - track1

To give some concrete example, I will use KDD CUP 2012 track1 data which will demonstrate how easy to setup and try the new method.

Preliminaries:
0) Download track 1 data from here. Extract the zip file.
1) Download and install GraphChi using steps 1-3.

2a) In the root graphchi folder, Create a file named rec_log_train.txt:info with the following lines:

%%MatrixMarket matrix coordinate real general
2500000 2500000 73209277


2b) link the file track1/rec_log_train.txt into the root graphchi folder:
cd graphchi
ln -s ../track1/rec_log_train.txt .

Let's look at the input file format:

<49|0>bickson@bigbro6:~/graphchi$ head rec_log_train.txt
2088948 1760350 -1 1318348785
2088948 1774722 -1 1318348785
2088948 786313 -1 1318348785
601635 1775029 -1 1318348785
601635 1902321 -1 1318348785
The input is of the format 
[user] [ add ] [ click ] [ timestamp ]

Where click is either -1 (not clicked) or 1 (clicked).

First step: regular matrix factorization

Now let's run a quick matrix factorization using user, item and rating:
 ./toolkits/collaborative_filtering/gensgd --training=rec_log_train.txt  --limit_rating=1000000 --max_iter=100 --gensgd_mult_dec=0.999999 --minval=-1 --maxval=1 --quiet=1  --calc_error=1 --file_columns=4 --gensgd_rate0=0.1 --gensgd_rate1=0.1 --gensgd_rate2=0.1 --gensgd_regw=0.1 --gensgd_reg0=0.1 --from_pos=0 --to_pos=1 --val_pos=2

Explanation: --training is the input file. --val_pos=2 means that the rating is in column 2, --rehash=1 means we treat all fields as strings (and thus support string values), --limit_rating means we handle only the first million ratings (to speed up the demo), --max_iter is the number of SGD iterations, --minval and --maxval are the allowed rating range, and --quiet less verbose output. --calc_error displays the classification error (how many predictions were wrong).--file_columns=4 says that there are 4 columns in the input file.

And here is the output we get:



WARNING:  common.hpp(print_copyright:204): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any  comments or bug reports to danny.bickson@gmail.com
[training] => [rec_log_train.txt]
[limit_rating] => [1000000]
[max_iter] => [100]
[gensgd_mult_dec] => [0.999999]
[minval] => [-1]
[maxval] => [1]
[quiet] => [1]
[calc_error] => [1]
[file_columns] => [4]
[gensgd_rate0] => [0.1]
[gensgd_rate1] => [0.1]
[gensgd_rate2] => [0.1]
[gensgd_regw] => [0.1]
[gensgd_reg0] => [0.1]
[from_pos] => [0]
[to_pos] => [1]
[val_pos] => [2]

 === REPORT FOR sharder() ===
[Timings]
edata_flush: 1.00698s (count: 265, min: 0.000625s, max: 0.005065, avg: 0.00379992s)
execute_sharding: 2.3855 s
finish_shard.sort: 0.648602s (count: 4, min: 0.156317s, max: 0.166634, avg: 0.162151s)
preprocessing: 1.72782 s
shard_final: 1.78102s (count: 4, min: 0.432858s, max: 0.454368, avg: 0.445255s)
[Other]
app: sharder
   31.7185) Iteration:   0 Training RMSE:   0.526537 Train err:  0.0010427
Step size 1 0.000387365  Step size 2 0.000780633  Step size 3 8.82609e-06
...
   295.691) Iteration:  99 Training RMSE:   0.206428 Train err: 0.000239218

We got a training RMSE error of 0.20, but only a training error of 0.02% (namely 99.98% of the predictions are correct)

Second step: temporal matrix factorization

Now let's add the time bins (3rd column) into the computation as feature and run again. This is done using the --features=3 command line flag:


./toolkits/collaborative_filtering/gensgd --training=rec_log_train.txt  --limit_rating=1000000 --max_iter=100 --gensgd_mult_dec=0.999999 --minval=-1 --maxval=1 --quiet=1  --calc_error=1 --file_columns=4 --gensgd_rate0=0.1 --gensgd_rate1=0.1 --gensgd_rate2=0.1 --gensgd_regw=0.1 --gensgd_reg0=0.1 --from_pos=0 --to_pos=1 --val_pos=2 --features=3 --gensgd_rate3=0.1 --gensgd_rate4=0.1
WARNING:  common.hpp(print_copyright:95): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any  comments or bug reports to danny.bickson@gmail.com 
INFO:     gensgd.cpp(main:1140): Total selected features: 1 : 
   3.17175) Iteration:   0 Training RMSE:   0.522901 Train err:   0.033788
...
    284.147) Iteration:  99 Training RMSE:  0.0275943 Train err:          0

By adding the time bins into consideration, we get an improvement from RMSE 0.206 to 0.027 !!
Furthermore, the classification error is down to zero. 

Third step: let's throw in some user features!

Besides of add rating data, we have some additional information about the users. The file user_profile.txt holds some properties of each user. The file has the following format:
100044 1899 1 5 831;55;198;8;450;7;39;5;111
100054 1987 2 6 0
100065 1989 1 57 0
100080 1986 1 31 113;41;44;48;91;96;42;79;92;35
100086 1986 1 129 0
100097 1981 1 75 0
100100 1984 1 47 71;51

The file has the following format:
[user ] [ year of birth ] [ gender ] [ number of tweets ] [ tag ids (area of interest) ]

Adding user features is simply done by the flag --user_file=user_profile.txt

./toolkits/collaborative_filtering/gensgd --training=rec_log_train.txt --val_pos=2 --rehash=1 --limit_rating=1000000 --max_iter=100 --gensgd_mult_dec=0.999999 --minval=-1 --maxval=1 --quiet=1  --calc_error=1 --file_columns=4 --features=3 --last_item=1 --user_file=user_profile.txt
WARNING:  common.hpp(print_copyright:95): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any  comments or bug reports to danny.bickson@gmail.com 
INFO:     gensgd.cpp(main:1140): Total selected features: 1 : 
   2.02809) Iteration:   0 Training RMSE:          0 Train err:          0
   2.90718) Iteration:   1 Training RMSE:   0.511614 Train err:   0.022662
   3.74655) Iteration:   2 Training RMSE:    0.49371 Train err:   0.017136
   4.55983) Iteration:   3 Training RMSE:   0.479225 Train err:   0.015074
   5.40781) Iteration:   4 Training RMSE:   0.465404 Train err:   0.016538
   6.27764) Iteration:   5 Training RMSE:   0.451063 Train err:   0.015657
...
   77.5867) Iteration:  96 Training RMSE:  0.0177382 Train err:          0
   78.3384) Iteration:  97 Training RMSE:  0.0176325 Train err:          0
   79.0683) Iteration:  98 Training RMSE:  0.0174947 Train err:          0
   79.7872) Iteration:  99 Training RMSE:  0.0174152 Train err:          0

Overall we got another improvement from 0.018 to 0.0174

Step four: throw in some item features

In the KDD cup data, we are also given some item features, in the file item.txt

2335869 8.1.4.2 412042;974;85658;174033;974;9525;72246;39928;8895;30066;2245;1670;85658;174033;6977;6183;974;85658;174033;974;9525;72246;39928;8895;30066;2245;1670;85658;174033;6977;6183;974
1774844 1.8.3.6 31449;517124;45008;2796;79868;45008;202761;2796;101376;144894;31449;327552;133996;17409;2796;4986;2887;31449;6183;2796;79868;45008;13157;16541;2796;17027;2796;2896;4109;501517;2487;2184;9089;17979;9268;2796;79868;45008;202761;2796;101376;144894;31449;327552;133996;17409;2796;4986;2887;31449;6183;2796;79868;45008;13157;16541;2796;17027;2796;2896;4109;501517;2487;2184;9089;17979;9268

The format is:
[add id] [catergory] [ list of keywords ]


Let's throw in some item information into the algorithm. This is done using the --item_file parameter.



bickson@thrust:~/graphchi$ ./toolkits/collaborative_filtering/gensgd --training=rec_log_train.txt --val_pos=2 --rehash=1 --limit_rating=1000000 --max_iter=100 --gensgd_mult_dec=0.999999 --minval=-1 --maxval=1 --quiet=0 --features=3 --last_item=1   --quiet=1 --user_file=user_profile.txt --item_file=item.txt --gensgd_rate5=1e-5 --calc_error=1 --file_columns=4

WARNING:  common.hpp(print_copyright:95): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any  comments or bug reports to danny.bickson@gmail.com 
INFO:     gensgd.cpp(main:1140): Total selected features: 1 : 
   2.23951) Iteration:   0 Training RMSE:          0 Train err:          0
   4.95858) Iteration:   1 Training RMSE:   0.527203 Train err:   0.022205
   7.54827) Iteration:   2 Training RMSE:   0.499881 Train err:   0.022271
    10.026) Iteration:   3 Training RMSE:   0.476596 Train err:   0.024138
   12.4976) Iteration:   4 Training RMSE:   0.454496 Train err:   0.016523
   14.9459) Iteration:   5 Training RMSE:   0.431336 Train err:   0.016406
...
    217.96) Iteration:  96 Training RMSE:  0.0127242 Train err:          0
   220.116) Iteration:  97 Training RMSE:  0.0126185 Train err:          0
   222.317) Iteration:  98 Training RMSE:  0.0125111 Train err:          0
   224.559) Iteration:  99 Training RMSE:  0.0123526 Train err:          0



We got some significant RMSE improvement - from 0.017 to 0.012.

Thursday, Dec 13 - An update

I am getting a lot of readers inputs about this blog post, which is excellent!

One question I got from Xavier Amatriain, manager of recommendations @ Netflix, is why do I compute training error and not test error. Xavier is absolutely right, I was quite excited about the results so I wanted to share them before I even had time to compute the test error. Anyway I promise to do so in a couple of days. But I am quite sure that the model is quite accurate!

I got some interesting inputs from Tianqi Chen, author of SVDFeature software:
I think one important thing that we may want to add is the support of classification loss( which is extremely easy for SGD ). Since now days RMSE optimization seems get a bit out of fashioned and most data are click-through data and the optimization target is ranking instead of RMSE. I think the feature selection part is quite interesting. Since adding junk feature in those feature-based factorization model will almost hamper the performance. However, directly replacing L1 constraint on weight will work worse than L2 regularization, so I am curious what trick you used :-)

I also got comments from my golden collaborator Justin Yan:
1. for SGD-FM, it is hard to turn the parameters like learning rate and MCMC based method is slow.
2. Recently I find another great model- online bayisian probit regression (adpredictor) which bing has used in their CTR prediction. this model is a online learning model it is very fast,and the result is better than Logistic regression, so I am thinking about borrowing some ideas from this model to improve LibFM to a online learning model.

The last kind of feedback I am getting is from companies how claim to already solved this problem.. I think that if the problem was already completely solved, I was not getting so much feedback about it.
What do you think?

Next: next generation cf - part 2 - trying the software on airline on time data.

9 comments:

  1. The feature selection is cool and seems there has been little study on this (However, due to the non-convex nature of the problem, direct Lasso may not work well). A general solver is quite appealing and saves time for writing specific solver, but we still need lots of effort on smart feature engineering. It is possible however, to pre-code some of the feature engineering tricks.

    Lastly, I would like to mention SVDFeature: http://svdfeature.apexlab.org, which also serves on the general purpose and runs fast.

    ReplyDelete
    Replies
    1. Thanks Tianqi for your comment - this is still work in progress so I would love to get any recommended and useful engineered features to add, or any other tricks you may have!

      Delete
  2. Hi Danny,
    I am trying to execute the above example using the following command:

    manukr007@manukr007-VirtualBox:~/graphchi$ ./toolkits/collaborative_filtering/gensgd --training=rec_log_train.txt --val_pos=2 --rehash=1 --limit_rating=1000000 --max_iter=100 --gensgd_mult_dec=0.999999 --minval=-1 --maxval=1 --quiet=1 --calc_error=1 --file_columns=4

    the process is terminated while generating report for sharder.. The output generated is as follows:
    WARNING: common.hpp(print_copyright:144): GraphChi Collaborative filtering library is written by Danny Bickson (c). Send any comments or bug reports to danny.bickson@gmail.com
    [training] => [rec_log_train.txt]
    [val_pos] => [2]
    [rehash] => [1]
    [limit_rating] => [1000000]
    [max_iter] => [100]
    [gensgd_mult_dec] => [0.999999]
    [minval] => [-1]
    [maxval] => [1]
    [quiet] => [1]
    [calc_error] => [1]
    [file_columns] => [4]

    === REPORT FOR sharder() ===
    [Timings]
    edata_flush: 0.754896s (count: 39, min: 0.002352s, max: 0.040892, avg: 0.0193563s)
    execute_sharding: 3.19784 s
    preprocessing: 2.13674 s
    read_shovel: 0.280509s (count: 3, min: 0.092642s, max: 0.094204, avg: 0.093503s)
    shard_final: 1.93983s (count: 3, min: 0.619938s, max: 0.671011, avg: 0.646611s)
    shovel_flush: 0.815377s (count: 3, min: 0.242643s, max: 0.324823, avg: 0.271792s)
    shovel_readcompressed: 0.278907s (count: 3, min: 0.092167s, max: 0.093669, avg: 0.092969s)
    [Other]
    app: sharder
    Killed

    Thanks,
    Manu

    ReplyDelete
  3. I suspect you run out of memory. Can you send me the matrix market header of your input file?
    thanks

    ReplyDelete
  4. Hi Danny,

    I increased the memory of my VM from 1 GB TO 2GB and it worked while performing the step 1 and 2 as described above.

    The matrix market header has the following:
    %%MatrixMarket matrix coordinate real general
    2500000 2500000 73209277

    Two of the output files created are "rec_log_train.txt_U_bias.mm" and "rec_log_train.txt_U.mm". Can you please guide as to what must be the next step to generate the recommendations for the users.

    I am now trying to run the "track1" dataset with the user features as described in step 3 and it execution stops with the following during generating the report for sharder:

    app: sharder
    terminate called after throwing an instance of 'std::bad_alloc'
    what(): std::bad_alloc
    Aborted (core dumped)

    On executing along with the item features also as described in step 4, the executions stops and following output is generated:

    [training] => [rec_log_train.txt]
    [val_pos] => [2]
    [rehash] => [1]
    [limit_rating] => [1000000]
    [max_iter] => [100]
    [gensgd_mult_dec] => [0.999999]
    [minval] => [-1]
    [maxval] => [1]
    [quiet] => [0]
    [features] => [3]
    [last_item] => [1]
    [quiet] => [1]
    [item_file] => [item.txt]
    [gensgd_rate5] => [1e-5]
    [calc_error] => [1]
    gensgd: ../../src/util/cmdopts.hpp:93: int graphchi::get_config_option_int(const char*): Assertion `false' failed.
    ERROR: could not find option file_columns from config.Aborted (core dumped)

    I have the user and item features file in the graphchi folder.

    Thanks,
    Manu

    ReplyDelete
    Replies
    1. You should add --file_columns=4. I have fixed documentation.

      Regarding bad_alloc - again this is out of memory error. On my machine it consumes 1.7GB virtual mem. but maybe your machine is limited and do not give the software the full requested memory.

      Delete
  5. We offer Loan Amount of : 5k-\$5m I want to fully assure you that you
    will get the funds if only you are very serious and trust worthy
    because i like serious.If interested email us samuelloaninvestment@hotmail.com

    Name:
    Loan amount:
    Duration:
    Country:
    State:
    Sex:
    Phone:

    ReplyDelete
  6. Getting the following exception while running gensgd

    FATAL: gensgd.cpp(gensgd_predict:909): Got into numerical problems. Try to decrease step size

    Running the command in the following way

    toolkits/collaborative_filtering/gensgd --training=useractivitydata-sorted --rehash=1 --max_iter=10 --quiet=1 --file_columns=3 --calc_error=1 --val_pos=2 --gensgd_mult_dec=0.9 --minval=1 --maxval=539

    ReplyDelete
    Replies
    1. Hi Venkata,
      SGD algorithms have to fine tune step size. I suggest setting the following command line parameters to zero:
      gensgd_rate1, gensgd_rate2, gensgd_rate3, gensgd_rate4, gensgd_rate5 and then slowly increasing those step sizes for example 1e-8, 1e-7, 1e-6.. until you get the numerical problems error. It will also help, if you could normalize your values, instead of 1 -> 539 if you can normalize them from 0 to 1 (or 1 to 2). Since the magnitude of the values affect step sizes as well.

      Delete