The objective in extreme multi-label learning is to learn a classifier that can automatically tag a datapoint with the most relevant subset of labels from an extremely large label space. PfastreXML is an efficient tree ensemble based extreme classifier that can scale to millions of labels. PfastreXML can be trained on most datasets using a desktop/small cluster and can make predictions in milliseconds per test point. PfastreXML improves upon FastXML by optimizing propensity scored nDCG at each tree node and by re-ranking the predicted labels.
This code is made available as is for non-commercial research purposes. Please make sure that you have read the license agreement in LICENSE.doc/pdf. Please do not install or use PfastreXML unless you agree to the terms of the license.
Download PfastreXML source code in C++ and Matlab as well as precompiled Windows/Linux binariesThe code for PfastreXML is written in C++ and should compile on 64 bit Windows/Linux machines using a C++11 enabled compiler. Matlab wrappers have also been provided with the code. Installation and usage instructions are provided below and in Readme.txt. The default parameters provided in the Usage Section work reasonably on the benchmark datasets in the Extreme Classification Repository.
Please contact Yashoteja Prabhu and Manik Varma if you have any questions or feedback.
Please visit the Extreme Classification Repository to download the benchmark datasets and compare PfastreXML's performance to baseline algorithms.
./PfastreXML_train [feature file name] [label file name] [inverse propensity file name] [model folder name] -S 0 -T 1 -s 0 -t 50 -b 1.0 -c 1.0 -m 10 -l 10 -g 30 -a 0.8Matlab:
PfastreXML_train([feature matrix], [label matrix], [inverse propensity vector], param, [model folder name])where:
-S ≡ param.pfswitch : PfastXML switch, setting this to 1 omits tail classifiers, leading to PfastXML algorithm default=0 -T ≡ param.num_thread : Number of threads to use default=1 -s ≡ param.start_tree : Starting tree index default=0 -t ≡ param.num_tree : Number of trees to be grown default=50 -b ≡ param.bias : Feature bias value, extra feature value to be appended default=1.0 -c ≡ param.log_loss_coeff : SVM weight co-efficient default=1.0 -l ≡ param.lbl_per_leaf : Number of label-probability pairs to retain in a leaf default=100 -g ≡ param.gamma : gamma parameter appearing in tail label classifiers default=30 -a ≡ param.alpha : Trade off parameter between PfastXML and tail classifier scores default=0.8 for propensity-weighted metrics/0.9 otherwise -m ≡ param.max_leaf : Maximum allowed instances in a leaf node. Larger nodes are attempted to be split, and on failure converted to leaves default=10
./PfastreXML_test [feature file name] [score file name] [model folder name] -S 0 -T 1 -s 0 -t 50 -n 1000Matlab:
[score matrix] = PfastreXML_test([feature matrix], param, [model folder name])where:
-S ≡ param.pfswitch : same as in training default=[value saved in trained model] -T ≡ param.num_thread : same as in training default=[value saved in trained model] -s ≡ param.start_tree : same as in training default=[value saved in trained model] -t ≡ param.num_tree : same as in training default=[value saved in trained model] -n ≡ param.actlbl : Number of predicted scores per test instance. Lower value means quicker prediction default=1000
[metrics] = get_all_metrics([test score matrix], [test label matrix], [inverse label propensity vector])
perl convert_format.pl [repository data file] [output feature file name] [output label file name]
[matrix] = read_text_mat([text matrix name]);To write a Matlab matrix into text format:
write_text_mat([Matlab sparse matrix], [text matrix name to be written to]);
[weights vector] = inv_propensity([training label matrix],A,B);A,B are the parameters of the inverse propensity model. Following values are to be used over the benchmark datasets:
Wikipedia-LSHTC: A=0.5, B=0.4 Amazon: A=0.6, B=2.6 Other: A=0.55, B=1.5