The objective in extreme multi-label learning is to learn a classifier that can automatically tag a datapoint with the most relevant subset of labels from an extremely large label space. Traditional embedding based approaches make training and prediction tractable by projecting the high dimensional label vectors onto low dimensional linear subspaces. Embedding approaches have many advantages including simplicity, ease of implementation, strong theoretical foundations, the ability to handle label correlations and the ability to adapt to online and incremental scenarios. Unfortunately, the low rank assumption on the label matrix is violated in most real world applications. For example, in the WikiLSHTC dataset, only 10% of the label matrix can be reconstructed using a 500 dimensional subspace. SLEEC addresses these problems by learning embeddings which preserve pairwise distances between only nearest label vectors. SLEEC can be used for large-scale problems with efficient training and prediction times, small model size and good accuracy.
Download SLEEC source code in Matlab
Please make sure that you have read the license agreement in LICENSE.doc/pdf. Please do not install or use SLEEC unless you agree to the terms of the license.
The code for SLEEC is written in Mex (Matlab) and should compile on 32/64 bit Windows/Linux machines. Please note that the current version has multi-threading support only on Windows. Installation and usage instructions will be found in Readme.txt. This code is made available as is for non-commercial research purposes.
The zip file containing the code also contains the parameters for the benchmark datasets.
Please contact Kush Bhatia and Manik Varma if you have any questions or feedback.
Please visit the Extreme Classification Repository to download the benchmark datasets and compare SLEEC's performance to baseline algorithms.
The SLEEC code works with Matlab on Linux and Windows. Please setup mex before executing the following commands.To compile SLEEC, run the following command on the Matlab prompt
SLEEC training and prediction code have been combined with the SLEEC functions. To run SLEEC, try
> load data.mat
> [result] = SLEEC(data, params);
The following explains the matlab structure for the data, parameters and the output result
data : structure containing the complete dataset data.X : sparse n x d matrix containing train features data.Xt : sparse m x d matrix containing test features data.Y : sparse n x L matrix containing train labels data.Yt : sparse m x L matrix containing test labels
params.num_learners : number of SLEEC learners in ensemble (default 10) params.num_clusters : Initial number of clusters (default 300) params.num_threads : Number of threads for parallelization (default 32) params.SVP_neigh : Number of nearest neighbours to be preserved (default 15) params.out_Dim : embedding dimensions (default 50) params.w_thres : 1-w_thresh is the sparsity of regressors w (default 0.7) params.sp_thresh : 1-sp_thresh is the sparsity of embeddings (default 0.7) params.cost : liblinear cost coefficient (default 0.1) params.NNtest : number of nearest neighbours to consider while testing (default 10) params.normalize : 1 for normalized data, 2 for unnormalized data (only for mediamill) params.fname : filename for logging purposes
result.clusterCenters : cluster centers for the different learners result.tim_clus : time taken for clustering result.SVPModel : model for different learners containing embeddings and regressors result.SVPtime_mat : time taken for performing SVP for each learner result.regressiontime_mat : time taken for learning regressors result.precision : overall precision accuracy result.predictAcc : precision accuracy per test point result.predictLabels : Top-k labels predicted per test point result.tim_test : time taken for the testing procedure result.test_KNN : kNN Matrix for the test points
The zip file containing the source code also includes the BibTeX dataset as a toy example. The following are the instructions to run SLEEC on the BibTeX dataset
> cd Toy_Example
> load bibtex.mat
> cd ..
> [result] = SLEEC(data, params);
To verify that you have run the code successfully, please compare the precision that you obtain with the result structure that has been provided in the Toy_Example folder.