Thursday, September 13, 2012

GraphChi parsers toolkit

Consecutive IDS parser

If you like to use GraphChi collaborative filtering library, you first need to parse your data into consecutive user integer ids and consecutive item integer ids.
For example, assume you have a user movie rating database with the following format:
Johny Walker,The Matrix (Director's Cut),5
Abba Babba,James Bond VII,3
Bondy Jr.,Titanic,1
Namely, each row contains one ratings with the format
<user name>,<item name>,<rating>

And you like to convert it to sparse matrix market format:
1 1  5
2 2  3
3 3  1

Namely user 1 rated item 1 with rating of 5, etc.

Additionally, you can also convert files of the format:
12032 12304-0323-X 3
Where for example 12032 is the first user ID, 12304-0323-X is the ISBN of the rated book and 3 is the rating value.

You can use the consecutive_matrix_market parser to create the appropriate format from your data files. 

The input to the parser is either CSV file, TSV file or a matrix market input file with non consecutive integer ids. The user and items can be either strings or integers.

There are several outputs to the consecutive IDS parser:
1) filename.out - a rating file with the format [user] [ movie] [rating] where both the user ids and item ids are consecutive integers. In our example

1 1  5
2 2  3
3 3  1

2) Mapping between user names/ids to consecutive integers. In our example:

Abba Babba 2
Bondy Jr. 3
Johny Walker 1
3) Mapping between movie names/ids to consecutive integers. In our example:

James Bond VII 2
The Matrix (Director's Cut) 1
Titanic 3

4+5) The same maps but in reverse, namely mapping back between unsigned integers to string ids.
6) Matrix market header file - in our examples there are 3 users, 3 movies and 3 ratings:

%%MatrixMarket matrix coordinate integer general
3 3 3

To run the consecutive IDS parser you should prepare a file with the input file name within it.
For example, the file name "movies" contains:

Johny Walker,The Matrix (Director's Cut),5
Abba Babba,James Bond VII,3
Bondy Jr.,Titanic,1

Finally, you run the parser:
./toolkits/parsers/consecutive_matrix_market --training=movies --csv=1

The supported command line flags are:
--csv=1 - csv format
--tsv=1 - tsv format
-> otherwise - sparse separated format

--file_list=XXX - input file which contains list of files to  parse (each file in a new line) [optional]

--single_domain - if set to 1, both from and to ids are from the same set of ids, if set to 0, from ids get a different range, and to ids get a different range. This is useful for bipartite graphs (Default is 0);

--binary - if set to 1, edges have no weight. If set to 0, edge weight is expected. (Default is 0).

--debug=1 - more verbose output (recommended). Default is 0.

Adj Parser

Adj format is explained here:

The Adjacency list file format stores on each line, a source vertex, followed by a list of all target vertices: each line has the following format:
[vertex ID]  [number of target vertices] [target ID 1] [target ID 2] [target ID 3] ...
This format is more compact to store, and in practice will partition better in the distributed setting (since edges have a natural grouping). Furthermore, this format is capable of storing disconnected vertices.
For instance, the following graph:
graph_format_example.gif
can be stored as:
1 2 2 5
7 2 7 1
5 1 7


The parser converts the Adj file into GraphChi CF compatible format.
For example, assume we have a file with 2 lines:

2884424247 11 1210682095 1789803763 1879013170 1910216645 2014570227 2109318072 2268277425 2289674265 2340794623 2513611825 2770280793
2885009247 16 1420862042 1641392180 1642909335 1775498871 1781379945 1784537661 1846581167 1934183965 2011304655 2016713117 2017390697 2128869911 2132021133 2645747085 2684129850 2866009832
Namely, one node with 11 edges and another node with 16 edges.

Assume the file name is b. We create a file named a with the first line contains "b" (without the qoutes!)
./toolkits/parsers/adj --file_list=a --single_domain=1
Starting program: /home/bickson/graphchi/toolkits/parsers/snap --file_list=a --single_domain=1
[Thread debugging using libthread_db enabled]
WARNING:  snap.cpp(main:204): GraphChi parsers library is written by Danny Bickson (c). Send any  comments or bug reports to danny.bickson@gmail.com 
[file_list] => [a]
[single_domain] => [1]
INFO:     snap.cpp(parse:196): Finished parsing total of 3 lines in file b
total map size: 29
Finished in 0.000591
INFO:     snap.cpp(save_map_to_text_file:74): Wrote a total of 29 map entries to text file: auser.map.text
INFO:     snap.cpp(save_map_to_text_file:86): Wrote a total of 29 map entries to text file: auser.reverse.map.text
INFO:     snap.cpp(main:255): Writing matrix market header into file: matrix_market.info


Now we look at the output:
bickson@thrust:~/graphchi$ cat b.out 
1 2
1 3
1 4
1 5
1 6
1 7
1 8
1 9
1 10
1 11
1 12
13 14
13 15
13 16
13 17
13 18
13 19
13 20
13 21
13 22
13 23
13 24
13 25
13 26
13 27
13 28
13 29

We also got a matrix market header that can be used in graphchi.
bickson@thrust:~/graphchi$ cat matrix_market.info 
%%MatrixMarket matrix coordinate integer general
29 29 27

Additionally map file between the old ids and new ids is stored in the directory.

LDA Parser

To the request of Baoqiang Cao I have started a parsers toolkits in GraphChi to be used for
preparing data to be used in GraphLab/ Graphchi. The parsers should be used as template which can be easy customized to user specific needs.
LDA parser code is found here.

Example input file:

I was not a long long ago here
because i am not so damn smart.
as you may have thought

The assumption is that each line holds words from a certain document.  We would like to assign the strings unique ids, and count how many words appear in each document. 

The input to GraphLab's LDA is in the format
[docid] [wordid] [word count]

Example run:
1) Create a file "stamfile" with the example input above.
2) Create a file "a" which has the file name "stamfile" in its first line.
>>head d
stamfile
3) Run the parser:
bickson@thrust:~/graphchi$ ./toolkits/parsers/texttokens --file_list=a --min_threshold=1 --max_threshold=10
WARNING:  texttokens.cpp(main:146): GraphChi parsers library is written by Danny Bickson (c). Send any  comments or bug reports to danny.bickson@gmail.com 
[file_list] => [a]
INFO:     texttokens.cpp(parse:138): Finished parsing total of 4 lines in file stamfile
total map size: 16
Finished in 0.024819
INFO:     texttokens.cpp(save_map_to_text_file:56): Wrote a total of 16 map entries to text file: amap.text
INFO:     texttokens.cpp(save_map_to_text_file:68): Wrote a total of 16 map entries to text file: areverse.map.text

Explanation: the min_threshold filters output to words that appear at least min_threshold times, and the same with max_threshold.

The output of the parser is a file named stamfile.out
bickson@thrust:~/graphchi$ cat stamfile.out 
2 1 1
2 2 1
2 3 1
2 4 2
2 5 1
2 6 1
3 11 1
3 3 1
3 7 1
3 8 1
3 9 1
3 10 1
4 12 1
4 13 1
4 14 1
4 15 1
4 16 1

And the mapping files:
bickson@thrust:~/graphchi$ cat amap.text
may 14
long 4
am 8
because 7
you 13
as 12
so 9
here 6
have 15
not 3
i 1
smart 11
thought 16
damn 10
was 2
ago 5

bickson@thrust:~/graphchi$ cat areverse.map.text 
1 i
2 was
3 not
4 long
5 ago
6 here
7 because
8 am
9 so
10 damn
11 smart
12 as
13 you
14 may
15 have
16 thought


Twitter parser

A second parser, which is slightly more fancy, go over user twitter tweets in the following format:


/*
 * Twitter input format is:
 *
 * T  2009-06-11 16:56:42
 * U  http://twitter.com/tiffnic85
 * W  Bus is pulling out now. We gotta be in LA by 8 to check into the Paragon.
 *
 * T  2009-06-11 16:56:42
 * U  http://twitter.com/xlamp
 * W  灰を灰皿に落とそうとすると高確率でヘッドセットの線を根性焼きする形になるんだが
 *
 * T  2009-06-11 16:56:43
 * U  http://twitter.com/danilaselva
 * W  @carolinesweatt There are no orphans...of God! :) Miss your face!
 *
 */
And extracts graph links between the retweets.

Aggregate parser

This parser computes the aggregate of the requested column.
The input is a sorted graph:
1 2 4 
1 2 4 
2 3 3 
3 4 2 
3 4 2 

And the output is the aggregate of the requested column (column 3 in the example):

1 2 8
2 3 3 
3 4 2 
3 4 4

Operating the parsers

1) Install graphchi as explained in steps 1+2 here.
2) Compile with "make parsers"
3) Run using ./toolkits/parsers/texttoken --file_list=files


Cascading several parsers together

I got this from Stefan Weigert:
I would like to use GraphLab for a top-k algorithm. There is a communication-log which i want to process as input-file. Assume, the log contains the following entries :

YVjAeZQjnVA IfrTTVlatui 050803 156
YVjAeZQjnVA IfrTTVlatui 050803 12
GNgrmichxmG GNgriWokEhN 050803 143
YnRdCKZkLao MHexzaXWCPL 050803 0
RGNReqpKcZw RGNRSTDdqew 050803 0
LPHSeuGhYkN ZFwbovKzAxY 050803 1
sijmyRRfkwl XtqJaHYFEPqbZqNGPCr 050803 68
RGNReqpKcZw RGNRSTDdqew 050805 6
RGNReqpKcZw RGNRSTDdqew 050805 6
sijmyRRfkwl XtqJaHYFEPqbZqNGPCr 050805 12

Where each line has the following format:
[ caller id] [ receiver id ] [date ] [call duration]


What i wanted to do is
1) load the first line, add an edge YVjAeZQjnVA-> IfrTTVlatui with 156 as attribute
2) load the second line, see that YVjAeZQjnVA-> IfrTTVlatui exists already and only change the edge-attrib to 168 (168 = 156+12)

Proposed solution:

1) Run consecutive ids parser to convert the strings to consecutive ids.
    a) Assume the input file is named "test". Prepare a list file named "files" with the string "test" in its first line.
   b) run consecutive parser with:

./toolkits/parsers/consecutive_matrix_market --file_list=files
WARNING:  consecutive_matrix_market.cpp(main:177): GraphChi parsers library is written by Danny Bickson (c). Send any  comments or bug reports to danny.bickson@gmail.com 
[file_list] => [a]
INFO:     consecutive_matrix_market.cpp(parse:169): Finished parsing total of 10 lines in file test
total map size: 6
Finished in 0.000117
INFO:     consecutive_matrix_market.cpp(save_map_to_text_file:69): Wrote a total of 6 map entries to text file: auser.map.text
INFO:     consecutive_matrix_market.cpp(save_map_to_text_file:81): Wrote a total of 6 map entries to text file: auser.reverse.map.text
INFO:     consecutive_matrix_market.cpp(save_map_to_text_file:69): Wrote a total of 6 map entries to text file: amovie.map.text
INFO:     consecutive_matrix_market.cpp(save_map_to_text_file:81): Wrote a total of 6 map entries to text file: amovie.reverse.map.text
INFO:     consecutive_matrix_market.cpp(main:225): Writing matrix market header into file: matrix_market.info
   c) The output is saved into the file: test.out
1 1 050803 156
1 1 050803 12
2 2 050803 143
3 3 050803 0
4 4 050803 0
5 5 050803 1
6 6 050803 68
4 4 050805 6
4 4 050805 6
6 6 050805 12

2) Sort the output file:
sort -k 1,1 -k 2,2 -g test.out > test.sorted
The sorted file is:
1 1 050803 156
1 1 050803 12
2 2 050803 143
3 3 050803 0
4 4 050803 0
...

3) Run the aggregator parser:
   a) edit the file "files" to contain the string test.sorted in the first line.
   b) Run the aggregator parser:
   bickson@thrust:~/graphchi$ ./toolkits/parsers/aggregator --file_list=a --col=4
WARNING:  aggregator.cpp(main:151): GraphChi parsers library is written by Danny Bickson (c). Send any  comments or bug reports to danny.bickson@gmail.com 
[file_list] => [a]
[col] => [4]
INFO:     aggregator.cpp(parse:137): Hash map size: 0 at time: 6.4e-05 edges: 1
INFO:     aggregator.cpp(parse:137): Hash map size: 0 at time: 8.6e-05 edges: 2
INFO:     aggregator.cpp(parse:137): Hash map size: 0 at time: 9.6e-05 edges: 3
INFO:     aggregator.cpp(parse:137): Hash map size: 0 at time: 0.000103 edges: 4
INFO:     aggregator.cpp(parse:137): Hash map size: 0 at time: 0.000109 edges: 5
INFO:     aggregator.cpp(parse:137): Hash map size: 0 at time: 0.000114 edges: 6
INFO:     aggregator.cpp(parse:137): Hash map size: 0 at time: 0.000119 edges: 7
INFO:     aggregator.cpp(parse:137): Hash map size: 0 at time: 0.000124 edges: 8
INFO:     aggregator.cpp(parse:137): Hash map size: 0 at time: 0.000129 edges: 9
INFO:     aggregator.cpp(parse:137): Hash map size: 0 at time: 0.000134 edges: 10
INFO:     aggregator.cpp(parse:143): Finished parsing total of 11 lines in file test.sorted
total map size: 0
Finished in 0.000559
c) Now the output is:
1 1 168
2 2 143
3 3 0
4 4 12
5 5 1
6 6 80

We are done!


ips2ids parser

This parser converts graph in the string ip address format (A.B.C.D) into consecutive id format.
Example input file:
192.179.12.12 13.142.5.5 12
167.12.48.5 132.1.1.4 6
...



counter 

This parser takes a text file with unsigned int in each line and counts how many instances of each unsigned int are there.

For example, an input file named "in":

  1
  2
  3
  3
  3
  3
  3
  3
  3
  2

  The output of "./count in"
  1 1
 2 2
 3 8


Acknowledgements/ hall of fame

5 comments:

  1. Hi,

    I have found that consecutive_market_matrix is inside parser folder and not in (./toolkits/collaborative_fitering/consecutive_matrix_market --file_list=list --csv=1)...kindly update the documentation.

    second, i am experancing err while compiling consecutive_matrix_market.

    mburhan@mburhan-Vostro-3460:~/graphchi$ make parsers
    cd toolkits/parsers/; make
    make[1]: Entering directory `/home/mburhan/graphchi/toolkits/parsers'
    g++ -g -O3 -I/usr/local/include/ -I../../src/ -I. -lz -fopenmp -Wall -Wno-strict-aliasing adj.cpp -o adj
    In file included from ../../src/preprocessing/conversions.hpp:38:0,
    from ../../src/graphchi_basic_includes.hpp:54,
    from adj.cpp:31:
    ../../src/preprocessing/sharder.hpp: In member function ‘void graphchi::sharder::write_shards() [with EdgeDataType = graphchi::dummy]’:
    ../../src/preprocessing/sharder.hpp:349:13: instantiated from ‘int graphchi::sharder::execute_sharding(std::string) [with EdgeDataType = graphchi::dummy, std::string = std::basic_string]’
    ../../src/preprocessing/conversions.hpp:556:65: instantiated from here
    ../../src/preprocessing/sharder.hpp:669:23: warning: variable ‘lastdst’ set but not used [-Wunused-but-set-variable]
    /tmp/ccKFFWiY.o: In function `unsigned long write_compressed(int, char*, unsigned long)':
    /home/mburhan/graphchi/toolkits/parsers/../../src/util/ioutil.hpp:157: undefined reference to `deflateInit_'
    /home/mburhan/graphchi/toolkits/parsers/../../src/util/ioutil.hpp:174: undefined reference to `deflate'
    /home/mburhan/graphchi/toolkits/parsers/../../src/util/ioutil.hpp:188: undefined reference to `deflateEnd'
    /home/mburhan/graphchi/toolkits/parsers/../../src/util/ioutil.hpp:178: undefined reference to `deflateEnd'
    /tmp/ccKFFWiY.o: In function `void read_compressed(int, char*, unsigned long)':
    /home/mburhan/graphchi/toolkits/parsers/../../src/util/ioutil.hpp:213: undefined reference to `inflateInit_'
    /home/mburhan/graphchi/toolkits/parsers/../../src/util/ioutil.hpp:234: undefined reference to `inflate'
    /home/mburhan/graphchi/toolkits/parsers/../../src/util/ioutil.hpp:251: undefined reference to `inflateEnd'
    /tmp/ccKFFWiY.o: In function `unsigned long write_compressed(int, unsigned char*, unsigned long)':
    /home/mburhan/graphchi/toolkits/parsers/../../src/util/ioutil.hpp:157: undefined reference to `deflateInit_'
    /home/mburhan/graphchi/toolkits/parsers/../../src/util/ioutil.hpp:174: undefined reference to `deflate'
    /home/mburhan/graphchi/toolkits/parsers/../../src/util/ioutil.hpp:188: undefined reference to `deflateEnd'
    /home/mburhan/graphchi/toolkits/parsers/../../src/util/ioutil.hpp:178: undefined reference to `deflateEnd'
    /tmp/ccKFFWiY.o: In function `void read_compressed(int, unsigned char*, unsigned long)':
    /home/mburhan/graphchi/toolkits/parsers/../../src/util/ioutil.hpp:213: undefined reference to `inflateInit_'
    /home/mburhan/graphchi/toolkits/parsers/../../src/util/ioutil.hpp:234: undefined reference to `inflate'
    /home/mburhan/graphchi/toolkits/parsers/../../src/util/ioutil.hpp:251: undefined reference to `inflateEnd'
    /tmp/ccKFFWiY.o: In function `unsigned long write_compressed >(int, graphchi::edge_with_value*, unsigned long)':
    /home/mburhan/graphchi/toolkits/parsers/../../src/util/ioutil.hpp:157: undefined reference to `deflateInit_'
    /home/mburhan/graphchi/toolkits/parsers/../../src/util/ioutil.hpp:174: undefined reference to `deflate'
    /home/mburhan/graphchi/toolkits/parsers/../../src/util/ioutil.hpp:188: undefined reference to `deflateEnd'
    /home/mburhan/graphchi/toolkits/parsers/../../src/util/ioutil.hpp:178: undefined reference to `deflateEnd'
    collect2: ld returned 1 exit status
    make[1]: *** [adj] Error 1
    make[1]: Leaving directory `/home/mburhan/graphchi/toolkits/parsers'
    make: *** [parsers] Error 2


    ReplyDelete
    Replies
    1. HI Burhan!
      Thanks for your note. I assume you are using Ubuntu? We had a recent bug resulting to zlib usage. Anyway I fixed it - please retake from mercurial (using: "hg pull; hg update; make clean; make parsers") and let me know if it works!

      Delete
    2. Dear Danny,

      Thank you, it is compiled successfully.

      Now I am experiancing err while parsing movies file. The format is below as per the documentation.

      ==>movies file contains the following content with 'movies' as first line

      movies
      Johny Walker,The Matrix (Director's Cut),5
      Abba Babba,James Bond VII,3
      Bondy Jr.,Titanic,1

      ==>command

      mburhan@mburhan-Vostro-3460:~/graphchi/toolkits/parsers$ ./consecutive_matrix_market --file_list=movies --csv=1
      WARNING: consecutive_matrix_market.cpp(main:188): GraphChi parsers library is written by Danny Bickson (c). Send any comments or bug reports to danny.bickson@gmail.com
      [file_list] => [movies]
      [csv] => [1]
      ERROR: consecutive_matrix_market.cpp(parse:152): Error when parsing file: movies:1[movies]
      Failed to open file: Johny

      Delete
    3. You need to create a file named "list" for example, with the file name :
      movies
      in the first line. Then you run with --file_list=list
      and not --file_list=movies

      Delete
    4. great! I got it now...
      I made two files, i.e 'list' and 'movies'
      the content of file 'list' contains the name of second file which is 'movies'. and then the content of 'movies' file contain values (customer_id, item_id, rating).

      it worked perfectly fine.

      Thank you.
      Burhan

      Delete