The objective in extreme multi-label learning is to learn a classifier that can automatically tag a datapoint with the most relevant subset of labels from an extremely large label space. FastXML is an efficient tree ensemble based extreme classifier that can scale to millions of labels. FastXML can be trained on most datasets using a desktop/small cluster and can make predictions in milliseconds per test point. Tree ensembles generally require a lot of RAM and FastXML is no exception.
Note; You might also be interested in PfastreXML.
This code is made available as is for non-commercial research purposes. Please make sure that you have read the license agreement in LICENSE.doc/pdf. Please do not install or use FastXML unless you agree to the terms of the license.
Download FastXML source code in C++ and Matlab as well as precompiled Windows/Linux binariesThe code for FastXML is written in C++ and should compile on 64 bit Windows/Linux machines using a C++11 enabled compiler. Matlab wrappers have also been provided with the code. Installation and usage instructions are provided below and in Readme.txt. The default parameters provided in the Usage Section work reasonably on the benchmark datasets in the Extreme Classification Repository.
Please contact Yashoteja Prabhu and Manik Varma if you have any questions or feedback.
Please visit the Extreme Classification Repository to download the benchmark datasets and compare FastXML's performance to baseline algorithms.
./fastXML_train [feature file name] [label file name] [model folder name] -S 0 -T 1 -s 0 -t 50 -b 1.0 -c 1.0 -m 10 -l 10Matlab:
fastXML_train([feature matrix], [label matrix], param, [model folder name])where:
-T ≡ param.num_thread : Number of threads to use default=1 -s ≡ param.start_tree : Starting tree index default=0 -t ≡ param.num_tree : Number of trees to be grown default=50 -b ≡ param.bias : Feature bias value, extra feature value to be appended default=1.0 -c ≡ param.log_loss_coeff : SVM weight co-efficient default=1.0 -l ≡ param.lbl_per_leaf : Number of label-probability pairs to retain in a leaf default=100 -m ≡ param.max_leaf : Maximum allowed instances in a leaf node. Larger nodes are attempted to be split, and on failure converted to leaves default=10
./fastXML_test [feature file name] [score file name] [model folder name] T 1 -s 0 -t 50Matlab:
[score matrix] = fastXML_test([feature matrix], param, [model folder name])where:
-T ≡ param.num_thread : same as in training default=[value saved in trained model] -s ≡ param.start_tree : same as in training default=[value saved in trained model] -t ≡ param.num_tree : same as in training default=[value saved in trained model]
[metrics] = get_all_metrics([test score matrix], [test label matrix], [inverse label propensity vector])
perl convert_format.pl [repository data file] [output feature file name] [output label file name]
[matrix] = read_text_mat([text matrix name]);To write a Matlab matrix into text format:
write_text_mat([Matlab sparse matrix], [text matrix name to be written to]);
[weights vector] = inv_propensity([training label matrix],A,B);A,B are the parameters of the inverse propensity model. Following values are to be used over the benchmark datasets:
Wikipedia-LSHTC: A=0.5, B=0.4 Amazon: A=0.6, B=2.6 Other: A=0.55, B=1.5