SLEEC: Sparse Local Embeddings for Extreme multi-label Classification

Kush Bhatia • Himanshu Jain • Prateek Jain • Purushottam Kar • Manik Varma

The objective in extreme multi-label learning is to learn a classifier that can automatically tag a datapoint with the most relevant subset of labels from an extremely large label space. Traditional embedding based approaches make training and prediction tractable by projecting the high dimensional label vectors onto low dimensional linear subspaces. Embedding approaches have many advantages including simplicity, ease of implementation, strong theoretical foundations, the ability to handle label correlations and the ability to adapt to online and incremental scenarios. Unfortunately, the low rank assumption on the label matrix is violated in most real world applications. For example, in the WikiLSHTC dataset, only 10% of the label matrix can be reconstructed using a 500 dimensional subspace. SLEEC addresses these problems by learning embeddings which preserve pairwise distances between only nearest label vectors. SLEEC can be used for large-scale problems with efficient training and prediction times, small model size and good accuracy.

Download SLEEC

Download SLEEC source code in Matlab

Please make sure that you have read the license agreement in LICENSE.doc/pdf. Please do not install or use SLEEC unless you agree to the terms of the license.

The code for SLEEC is written in Mex (Matlab) and should compile on 32/64 bit Windows/Linux machines. Please note that the current version has multi-threading support only on Windows. Installation and usage instructions will be found in Readme.txt. This code is made available as is for non-commercial research purposes. The zip file containing the code also contains the parameters for the benchmark datasets.

Please contact Kush Bhatia and Manik Varma if you have any questions or feedback.

Experimental Results and Datasets

Please visit the Extreme Classification Repository to download the benchmark datasets and compare SLEEC's performance to baseline algorithms.

Usage

The SLEEC code works with Matlab on Linux and Windows. Please setup mex before executing the following commands.

To compile SLEEC, run the following command on the Matlab prompt
> make_SLEEC

SLEEC training and prediction code have been combined with the SLEEC functions. To run SLEEC, try
> load data.mat > SLEECparams > [result] = SLEEC(data, params); The following explains the matlab structure for the data, parameters and the output result

SLEEC Data

	data    : structure containing the complete dataset
	data.X  : sparse n x d matrix containing train features
	data.Xt : sparse m x d matrix containing test features
	data.Y  : sparse n x L matrix containing train labels
	data.Yt : sparse m x L matrix containing test labels

SLEEC Parameters

	params.num_learners : number of SLEEC learners in ensemble (default 10)
	params.num_clusters : Initial number of clusters (default 300)
	params.num_threads  : Number of threads for parallelization (default 32)
	params.SVP_neigh    : Number of nearest neighbours to be preserved (default 15)
	params.out_Dim      : embedding dimensions (default 50)
	params.w_thres      : 1-w_thresh is the sparsity of regressors w (default 0.7)
	params.sp_thresh    : 1-sp_thresh is the sparsity of embeddings (default 0.7)
	params.cost         : liblinear cost coefficient (default 0.1)
	params.NNtest       : number of nearest neighbours to consider while testing (default 10)
	params.normalize    : 1 for normalized data, 2 for unnormalized data (only for mediamill)
	params.fname        : filename for logging purposes

SLEEC Result

	result.clusterCenters     : cluster centers for the different learners
	result.tim_clus           : time taken for clustering
	result.SVPModel           : model for different learners containing embeddings and regressors
	result.SVPtime_mat        : time taken for performing SVP for each learner
	result.regressiontime_mat : time taken for learning regressors
	result.precision          : overall precision accuracy
	result.predictAcc         : precision accuracy per test point
	result.predictLabels      : Top-k labels predicted per test point
	result.tim_test           : time taken for the testing procedure
	result.test_KNN           : kNN Matrix for the test points

Toy Example

The zip file containing the source code also includes the BibTeX dataset as a toy example. The following are the instructions to run SLEEC on the BibTeX dataset

> cd Toy_Example > load bibtex.mat > bibtexParams > cd .. > [result] = SLEEC(data, params); To verify that you have run the code successfully, please compare the precision that you obtain with the result structure that has been provided in the Toy_Example folder.

References

[01] K. Bhatia, H. Jain, P. Kar, M. Varma, and P. Jain, Sparse Local Embeddings for Extreme Multi-label Classification, in NIPS, 2015. Manik's Home Page | Contact us | Privacy and Cookies | Terms of use | Trademarks | 2015 Microsoft