Monday, May 14, 2012

ELF (ensemble learning framework)

ELF is an ensemble learning software recommended by JustinYan. Using this software it is possible to predict a few ratings to combine a higher quality prediction. It was written by Michael Jahrer the winner of the Netflix prize. We used it for KDD CUP 2011.

Disclaimer: this software is very rough - not for the weak hearted.. Installation is rather complicated, usage is rather complicated and I have experienced many crashes. However it is a very comprehensive experience towards creating a proper ensemble library.

Installation

 Run ubuntu 11.10 on Intel platform (on Amazon EC2 use image: ami-6743ae0e) connect to the ubuntu instance:
ssh -i graphlabkey.pem ubuntu@ec2-184-73-45-88.compute-1.amazonaws.com

sudo apt-get update
sudo apt-get install build-essential ia32-libs rpm gcc-multilib curl libcurl4-openssl-dev


Download Intel c++ compiler from here:
 You should select: Intel® C++ Composer XE 2011 for Linux Includes Intel® C++ Compiler, Intel® Integrated Performance Primitives, Intel® Math Kernel Library, Intel® Parallel Building Blocks 

Register using the form, you will get an email with the license number.

tar xvzf l_ccompxe_intel64_2011.10.319.tgz
cd l_ccompxe_intel64_2011.10.319
./install.sh
>>select option 2
Follow instructions using the default options until completion. Add the following lines to /etc/ld.so.conf:
/opt/intel/composer_xe_2011_sp1.10.319/compiler/lib/intel64/
/opt/intel/composer_xe_2011_sp1.10.319/compiler/mkl/lib/intel64/
/opt/intel/composer_xe_2011_sp1.10.319/compiler/ipp/lib/intel64/

Run the command:
sudo ldconfig

For bash:
source /opt/intel/composer_xe_2011_sp1.10.319/bin/compilervars.sh intel64

Edit Makefile to have:
INTEL_PATH = /opt/intel/composer_xe_2011_sp1.10.319/

And also:
INCLUDE = -I$(INTEL_PATH)/compiler/include -I$(INTEL_PATH)/mkl/include -I$(INTEL_PATH)/ipp/include
LIB = -L$(INTEL_PATH)/mkl/lib/intel64/ -L$(INTEL_PATH)/ipp/lib/intel64/ -lmkl_solver_lp64 -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -lippcore -lipps -openmp -lpthread

Now run make. If all went fine you will get an executable named ELF.

Common errors:
1) YahooFinance.h(6): catastrophic error: cannot open source file "curl/curl.h"
Solution: install libcurl4-openssl-dev as instructed above.
 2) AlgorithmExploration.o InputFeatureSelector.o KernelRidgeRegression.o NeuralNetworkRBMauto.o nnrbm.o Autoencoder.o GBDT.o LogisticRegression.o YahooFinance.o -L/opt/intel/composer_xe_2011_sp1.10.319//mkl/lib/em64t -L/opt/intel/composer_xe_2011_sp1.10.319//ipp/em64t/sharedlib -lmkl_solver_lp64 -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -lippcoreem64t -lippsem64t -openmp -lpthread 
ld: cannot find -lippcoreem64t
ld: cannot find -lippsem64t
make: *** [main] Error 1
Solution: edit the Makefile as instructed above.

Setting up the software

Prepare you training data in CSV format where the last column is the target. Prepare your test data in CSV format. Create a directory named CSV, and inside it a file named Master.dsc with the following configuration:
dataset=CSV
isClassificationDataset=1

maxThreads=2
maxThreadsInCross=2
nCrossValidation=6
validationType=Retraining
positiveTarget=1.0
negativeTarget=-1.0
randomSeed=124391994
nMixDataset=20
nMixTrainList=100
standardDeviationMin=0.01
blendingRegularization=1e-4
blendingEnableCrossValidation=0
blendingAlgorithm=LinearRegression
enablePostNNBlending=0
enableCascadeLearning=0
enableGlobalMeanStdEstimate=0
enableSaveMemory=1
addOutputNoise=0
enablePostBlendClipping=0
enableFeatureSelection=0
featureSelectionWriteBinaryDataset=0
enableGlobalBlendingWeights=0
errorFunction=RMSE
disableWriteDscFile=0
enableStaticNormalization=0
#staticMeanNormalization=7.5
#staticStdNormalization=10
enableProbablisticNormalization=0
dimensionalityReduction=no
subsampleTrainSet=1.0
subsampleFeatures=1.0
globalTrainingLoops=1

[ALGORITHMS]
LinearModel_1.dsc
#KNearestNeighbor_1.dsc
#NeuralNetwork_1.dsc
#KernelRidgeRegression_1.dsc
#PolynomialRegression_1.dsc
#NeuralNetwork_1.dsc
#GBDT_1.dsc
Then create a LinearModel_1.dsc file with the following configuration:
ALGORITHM=LinearModel
ID=1
#TRAIN_ON_FULLPREDICTOR=
DISABLE=0

[int]
maxTuninigEpochs=10

[double]
initMaxSwing=1.0
initReg=0.01

[bool]
tuneRigeModifiers=0
enableClipping=0
enableTuneSwing=0

minimzeProbe=0
minimzeProbeClassificationError=0
minimzeBlend=1
minimzeBlendClassificationError=0

[string]
weightFile=LinearModel_1_weights.dat
fullPrediction=LinearModel_1.dat

Now create a subfolder called CSV/DataFiles, inside it a file called settings.txt with the following:
delimiter=,
train=train.csv
trainTargetColumn=19
test=test.csv
Where train.csv and test.csv point to your train and test filenames, and trainTargetColumn points to the last column of your data (column numbers start from zero).

Note: train and test should have the same number of columns. If the test does not have labels, then add a column with zeros.

Running ELF

For training do:

ubuntu@domU-12-31-35-00-21-42:~$ ./ELF CSV/ t
maxThreads(OPENMP): 4
Scheduler
Constructor Data
Open master .dsc file:CSV//Master.dsc
isClassificationDataset: 1
Set max. threads in MKL and IPP: 2
maxThreads(OPENMP): 2
Train 6-fold cross validation
ValidationType: Retraining
Set random seed to: 124391994
randomSeed: 124391994
frameworkMode: 0
Start scheduled training
Fill data
gradientBoostingLoops:1
DatasetReader
Read CSV from: CSV//DataFiles
#feat:5
Target values: [0]-1 [1]1 
descructor DatasetReader
reduce training set (current size:6162863) to 100% of its original size  [nothing to do]
subsample the columns (current:5) to 100% of columns (skip constant 1 features)  [nothing to do]
subsample the columns (current:5) to 100% of columns (skip constant 1 features)  [nothing to do]
Randomize the train dataset: 123257260 line swaps [..........] mixInd[0]:467808  mixInd[6162862]:3154542
Enable bagging:0
Set algorithm list (nTrained:0)
Load descriptor file: CSV//LinearModel_1.dsc
[META] ALGORITHM: LinearModel
[META] ID: 1
[META] DISABLE: 0
maxTuninigEpochs: 10
initMaxSwing: 1.0
initReg: 0.01
tuneRigeModifiers: 0
enableClipping: 0
enableTuneSwing: 0
minimzeProbe: 0
minimzeProbeClassificationError: 0
minimzeBlend: 1
minimzeBlendClassificationError: 0
weightFile: LinearModel_1_weights.dat
fullPrediction: LinearModel_1.dat
Alloc mem for cross validation data sets (doOnlyNormalization:0)
Cross-validation settings: 6 sets
Calculating mean and std per input
f:3lim f:4lim 

StdMin:0.01
Normalization:[Min|Max mean: -2.72612|-0.940528  Min|Max std: 0.01|0.687338]  Features: RawInputs[Min|Max value: -5.7863|0.64705]  AfterNormalization[Min|Max value:-4.45221|10.8926] on 5 features
Targets: min|max|mean [Nr0:-1|1|0.803235] [Nr1:-1|1|-0.803235] 
Save mean and std: CSV//TempFiles/normalization.dat.algo1.add0
Random seed:124391994
nFeatures:5
nClass:2
nDomain:1
nTrain:6162863 nValid:0 nTest:0
Make 616286300 index swaps (randomize sample index list)
partition size: 1.02714e+06
slot: TRAIN | PROBE
===================
0: 5135719 | 1027144
1: 5135719 | 1027144
2: 5135719 | 1027144
3: 5135720 | 1027143
4: 5135719 | 1027144
5: 5135719 | 1027144
6: 6162863 | 0
probe sum:6162863
Train algorithm:CSV//LinearModel_1.dsc
Load descriptor file: CSV//LinearModel_1.dsc
[META] ALGORITHM: LinearModel
[META] ID: 1
[META] DISABLE: 0
maxTuninigEpochs: 10
initMaxSwing: 1.0
initReg: 0.01
tuneRigeModifiers: 0
enableClipping: 0
enableTuneSwing: 0
minimzeProbe: 0
minimzeProbeClassificationError: 0
minimzeBlend: 1
minimzeBlendClassificationError: 0
weightFile: LinearModel_1_weights.dat
fullPrediction: LinearModel_1.dat
AlgoTemplate:CSV//LinearModel_1.dsc  Algo:CSV//DscFiles/LinearModel_1.dsc
Output File for cout redirect is set now to CSV//DscFiles/LinearModel_1.dsc
Floating point precision: 4 Bytes
Partition dataset to cross validation sets
Can not open effect file:CSV//FullPredictorFiles/
Init residuals
Write first 1000 lines of the trainset(Atrain.txt) and targets(AtrainTarget.txt)
Apply mean and std correction to train input features
Min/Max feature values after apply mean/std: -4.45221/10.8926
Min/Max target: -1/1
Mean target: 0.803235 -0.803235 

Constructor Data
Algorithm
StandardAlgorithm
LinearModel
Set data pointers
Start train StandardAlgorithm
Init standard algorithm
Read dsc maps (standard values)
Constructor BlendStopping
Number of predictors for blendStopping: 2 (+1 const, +1 new)


Blending regularization: 0.0001
 [CalcBlend] lambda:0.0001  [classErr:9.83825%] 
ERR Blend:0.59568

============================ START TRAIN (param tuning) =============================

Parameters to tune:
[REAL] name:reg   initValue:0.01
(min|max. epochs: 0|10)


==================== auto-optimize ====================

(epoch=0) reg=0.01 ...... [classErr:38.0955%]  [probe:0.992891]  [CalcBlend] lambda:0.0001  [classErr:9.83952%] ERR=0.583664 11[s][saveBest][SB]
(epoch=1) reg=0.008 ...... [classErr:38.1632%]  [probe:0.992889]  [CalcBlend] lambda:0.0001  [classErr:9.83963%] ERR=0.583661 11[s] !min! [saveBest][SB]
(epoch=2) reg=0.0064 ...... [classErr:38.2209%]  [probe:0.992888]  [CalcBlend] lambda:0.0001  [classErr:9.83973%] ERR=0.58366 11[s] !min! [saveBest][SB] accelerate 
(epoch=3) reg=0.0048422 ...... [classErr:38.2776%]  [probe:0.992888]  [CalcBlend] lambda:0.0001  [classErr:9.83976%] ERR=0.583661 11[s]
(epoch=4) reg=0.008 ...... [classErr:38.1632%]  [probe:0.992889]  [CalcBlend] lambda:0.0001  [classErr:9.83963%] ERR=0.583661 11[s]
(epoch=5) reg=0.00535367 ...... [classErr:38.2585%]  [probe:0.992888]  [CalcBlend] lambda:0.0001  [classErr:9.83979%] ERR=0.583661 12[s]
(epoch=6) reg=0.00738248 ...... [classErr:38.1849%]  [probe:0.992888]  [CalcBlend] lambda:0.0001  [classErr:9.83968%] ERR=0.583661 11[s]
(epoch=7) reg=0.00570903 ...... [classErr:38.2454%]  [probe:0.992888]  [CalcBlend] lambda:0.0001  [classErr:9.83983%] ERR=0.58366 11[s]
(epoch=8) reg=0.00701252 ...... [classErr:38.1978%]  [probe:0.992888]  [CalcBlend] lambda:0.0001  [classErr:9.83968%] ERR=0.58366 11[s]
(epoch=9) reg=0.00594873 ...... [classErr:38.2369%]  [probe:0.992888]  [CalcBlend] lambda:0.0001  [classErr:9.83983%] ERR=0.58366 11[s]
(epoch=10) reg=0.00678554 max. epochs reached.
expSearchErrorBest:0.58366  error:0.58366

============================ END auto-optimize =============================


Calculate FullPrediction (write the prediction of the trainingset with cross validation)

Blending weights (row: classes, col: predictors[1.col=const predictor])
0.799 1.011 
-0.799 1.011 
Save blending weights: CSV//TempFiles/blendingWeights_02.dat

Write full prediction: CSV//FullPredictorFiles/LinearModel_1.dat (RMSE:0.992888)
Validation type: Retraining
Update model on whole training set

Save:CSV//TempFiles/LinearModel_1_weights.dat.006
Calculate retrain RMSE (on trainset)
Train of this algorithm (RMSE after retraining): 0.992894
Total retrain time:3[s]

===========================================================================
Constructor BlendStopping
ADD:CSV//FullPredictorFiles/LinearModel_1.dat Number of predictors for blendStopping: 2 (+1 const)

File:CSV//FullPredictorFiles/LinearModel_1.dat  RMSE:0.992888

Blending regularization: 0.0001
 [CalcBlend] lambda:0.0001 Blending weights (row: classes, col: predictors[1.col=const predictor])
0.799 1.011 
-0.799 1.011 
[Write train prediction:CSV//TempFiles/trainPrediction.data] nSamples:6162863
 [classErr:9.83973%] Blending weights (row: classes, col: predictors[1.col=const predictor])
0.799 1.011 
-0.799 1.011 
Save blending weights: CSV//TempFiles/blendingWeights_02.dat

BLEND RMSE OF ACTUAL FULLPREDICTION PATH:0.58366
===========================================================================

destructor BlendStopping
delete algo
descructor LinearModel
descructor StandardAlgorithm
destructor BlendStopping
descructor Algorithm
destructor Data
Finished train algorithm:CSV//LinearModel_1.dsc
Finished in 275[s]
Clear output file for cout
Delete internal memory
Total training time:399[s]
descructor Scheduler
destructor Data

No comments:

Post a Comment