FastXML: A Fast, Accurate and Stable Tree-classifier for eXtreme Multi-label Learning

Yashoteja Prabhu • Manik Varma

The objective in extreme multi-label learning is to learn a classifier that can automatically tag a datapoint with the most relevant subset of labels from an extremely large label space. FastXML is an efficient tree ensemble based extreme classifier that can scale to millions of labels. FastXML can be trained on most datasets using a desktop/small cluster and can make predictions in milliseconds per test point. Tree ensembles generally require a lot of RAM and FastXML is no exception.

Note; You might also be interested in PfastreXML.

Download FastXML

This code is made available as is for non-commercial research purposes. Please make sure that you have read the license agreement in LICENSE.doc/pdf. Please do not install or use FastXML unless you agree to the terms of the license.

Download FastXML source code in C++ and Matlab as well as precompiled Windows/Linux binaries

The code for FastXML is written in C++ and should compile on 64 bit Windows/Linux machines using a C++11 enabled compiler. Matlab wrappers have also been provided with the code. Installation and usage instructions are provided below and in Readme.txt. The default parameters provided in the Usage Section work reasonably on the benchmark datasets in the Extreme Classification Repository.

Please contact Yashoteja Prabhu and Manik Varma if you have any questions or feedback.

Experimental Results and Datasets

Please visit the Extreme Classification Repository to download the benchmark datasets and compare FastXML's performance to baseline algorithms.

Usage

Linux/Windows makefiles for compiling FastXML have been provided with the source code. To compile, run "make" (Linux) or "nmake -f Makefile.win" (Windows) in the FastXML folder. Run the following commands from inside the FastXML folder for training and testing.

Training

C++:

	./fastXML_train [feature file name] [label file name] [model folder name] -S 0 -T 1 -s 0 -t 50 -b 1.0 -c 1.0 -m 10 -l 10

Matlab:

	fastXML_train([feature matrix], [label matrix], param, [model folder name])

where:

	-T ≡ param.num_thread		: Number of threads to use										default=1
	-s ≡ param.start_tree		: Starting tree index											default=0
	-t ≡ param.num_tree		: Number of trees to be grown										default=50
	-b ≡ param.bias			: Feature bias value, extra feature value to be appended						default=1.0
	-c ≡ param.log_loss_coeff	: SVM weight co-efficient										default=1.0
	-l ≡ param.lbl_per_leaf		: Number of label-probability pairs to retain in a leaf							default=100
	-m ≡ param.max_leaf		: Maximum allowed instances in a leaf node. Larger nodes are attempted to be split, and on failure converted to leaves		default=10

Testing

C++:

	./fastXML_test [feature file name] [score file name] [model folder name] T 1 -s 0 -t 50

Matlab:

	[score matrix] = fastXML_test([feature matrix], param, [model folder name])

where:

	-T ≡ param.num_thread		: same as in training											default=[value saved in trained model]
	-s ≡ param.start_tree		: same as in training											default=[value saved in trained model]
	-t ≡ param.num_tree		: same as in training											default=[value saved in trained model]

Performance Evaluation

Scripts for performance evaluation are only available in Matlab. To compile these scripts, execute "make" in the Tools folder from the Matlab terminal.
Following command is executed from Tools/metrics folder and outputs a struct containing all the metrics:

	[metrics] = get_all_metrics([test score matrix], [test label matrix], [inverse label propensity vector])

Miscellaneous

The data format required by FastXML for feature and label input files is different from the format used in the repository datasets. To convert from the repository format to FastXML format, run the following command from the 'Tools' folder:
```
	perl convert_format.pl [repository data file] [output feature file name] [output label file name] 
```
Scripts are provided in the 'Tools' folder for sparse matrix inter conversion between Matlab .mat format and text format.
To read a text matrix into Matlab:
```
	[matrix] = read_text_mat([text matrix name]); 
```
To write a Matlab matrix into text format:
```
	write_text_mat([Matlab sparse matrix], [text matrix name to be written to]);
```
To generate inverse label propensity weights, run the following command inside 'Tools/metrics' folder on Matlab terminal:
```
	[weights vector] = inv_propensity([training label matrix],A,B); 
```
A,B are the parameters of the inverse propensity model. Following values are to be used over the benchmark datasets:
```
	Wikipedia-LSHTC: A=0.5,  B=0.4
	Amazon:          A=0.6,  B=2.6
	Other:		 A=0.55, B=1.5
```

Toy Example

The zip file containing the source code also includes the EUR-Lex dataset as a toy example. To run FastXML on the EUR-Lex dataset, execute "bash sample_run.sh" (Linux) or "sample_run" (Windows) in the FastXML folder.

References

1 Y. Prabhu, and M. Varma, FastXML: A Fast, Accurate and Stable Tree-classifier for eXtreme Multi-label Learning, in KDD 2014.