LDKL: Local Deep Kernel Learning

Suraj Jain • Cijo Jose • Prasoon Goyal • Manik Varma

Introduction

There has been an explosion in the size of modern day training sets with the advent of big data, cheap crowdsourcing and other techniques for gathering labelled information efficiently. This presents a significant challenge for non-linear SVMs since their cost of prediction can grow linearly with the size of the training set. Thus, even though non-linear SVMs have defined the state-of-the-art on multiple benchmark tasks, their use in real world applications remains limited.

We develop a Local Deep Kernel Learning (LDKL) technique [1] for efficient non-linear SVM prediction while maintaining classification accuracy above an acceptable threshold. LDKL learns a tree-based primal feature embedding which is high dimensional and sparse. Primal based classification decouples prediction costs from the number of support vectors and the size of the training set and LDKL’s tree-structured features efficiently encode non-linearities while speeding up prediction exponentially over the state-of-the-art. We develop routines for optimizing over the space of tree-structured features and efficiently scale to problems with more than half a million training points. Experiments on benchmark data sets reveal that LDKL can reduce prediction costs by more than three orders of magnitude in some cases with a moderate sacrifice in classification accuracy as compared to RBF-SVMs. Furthermore, LDKL can achieve better classification accuracies over leading methods for speeding up non-linear SVM prediction. In particular, LDKL is significantly better than kernel approximation techniques, such as Random Fourier Features [2,3] and Nyström [4,5], as it focusses on the decision boundary rather than modeling the kernel everywhere in space. LDKL can also be much faster than reduced set methods [6,7,8] as its tree-structured features can be computed in logarithmic time.

Results

We've made slight modifications to the LDKL implementation so the results reported below are slightly better than those reported in our ICML 2013 paper.

Table 1: A comparison of LDKL's and RBF-SVM's classification accuracies and prediction times.
Data Set	Classification Accuracy (%)			Training Time (Minutes)		Prediction Time (Normalized)		Prediction Speedup over RBF-SVM

Data Set	Linear SVM	RBF-SVM	LDKL	RBF-SVM	LDKL	RBF-SVM	LDKL	Prediction Speedup over RBF-SVM

Banana	59.18	90.30	89.53	0.00067	0.01017	1481.81	32.12	46.13
CIFAR	69.16	81.62	75.89	1283.68	0.278	38094.48	5.37	7081.60
CoverType	76.27	91.87	90.03	1369.96	7.99	229418	16.67	13757.95
IJCNN	92.20	98.68	98.31	0.45	0.09	6684.53	12.24	546.00
Letter	73.08	98.10	96.02	0.43	2.2	11020	33.29	330.98
Magic04	79.08	86.65	86.19	0.17	0.047	9489.79	13.26	709.92
MNIST	87.90	97.45	97.28	39.12	1.376	12130.98	9.37	1294.42
USPS	83.65	96.96	95.63	0.75	0.09	3375	8.15	413.79
RCV1	93.18	-	95.27	-	0.5	-	1.18	-
MNIST 8Million	79.24	-	98.01	-	65.21	-	1.16	-
Note that the actual prediction times have been normalized by dividing by a linear SVM's prediction time. This gives a sense of how much slower a method is as compared to a linear SVM and is somewhat robust to the actual hardware configuration on which the predictions were carried out.

Plot of accuracy vs normalized prediction time

Figure 1: Plots of prediction time versus prediction accuracy for state-of-the-art methods for speeding up non-linear SVM prediction. LDKL can have significantly higher classification accuracies as compared to the other methods for a given prediction cost. A method's curve not appearing in the plot for a given data set indicates that, for the given prediction cost range, the methods performance was worse than that of a linear SVM.

Bar charts of prediction accuracy for a fixed prediction cost

Figure 2: Bar charts of prediction accuracy for a fixed prediction cost.

LDKL's classification accuracies are,by and large, within a percent or two of the RBF-SVM's while its prediction time can be two to four orders of magnitude lower. All our experiments were carried out on a single core of a 2.67 GHz Intel Xeon processor with 8GB of RAM.

Download

Download the source code and Windows/Linux binaries

Please make sure that you have read the license agreement in LICENSE.doc/pdf before installing or using LDKL.

The code is in C++ and should compile on 32/64 bit Windows/Linux machines. The code also contains a Matlab interface. Please set "#define WINDOWS 1" in "include/Timer.h" when compiling on Windows machines and set the system PATH to include CL.EXE before running Makefile.bat (see the included README for detailed instructions). This code is made available as is for non-commercial research purposes. Please go through the EULA for more details.

Download the data sets used in the LDKL ICML 2013 paper

Please note that these data sets have been preprocessed in some cases (binarized, made zero mean and unit variance, etc.) so they might not match those available on the UCI/LibSVM page. In order to reproduce our results, please download and put all the data sets in the datasets folder. Compile LDKL and locate and run the ReproduceResult script. The learnt models and results will be stored in the Models and Results folders.

Please contact Suraj Jain and Manik Varma if you have any questions or feedback.

Usage

Training: The command

> LDKLTrain [Options] TrainingDataFile [ModelFile]

reads in training data from TrainingDataFile and saves the learnt LDKL model in ModelFile subject to the following options

-D : Depth of the LDKL tree.
-lW : λ_W = regularizer for classifier parameter W (try : 0.1, 0.01 or 0.001).
-lT : λ_θ = regularizer for kernel parameter θ (try : 0.1, 0.01 or 0.001).
-lP : λ_θ' = regularizer for kernel parameters θ' (try : 0.1, 0.01 or 0.001).
-S : σ = parameter for sigmoid sharpness (try : 1.0, 0.1 or 0.01).
-N : [Optional: default 1] Number of times to train LDKL with different random initializations.
-I : [Optional: default 15000] Number of passes through the dataset.
TrainingDataFile : File containing training data.
ModelFile : [Optional: default TrainingDataFile.model] Files containing the models learnt by LDKL. For example, given the command LDKLTrain -N 5 ... banana.train, LDKL will learn 5 different models and store them in banana.train.model1, ..., banana.train.model5.

Please refer to the included ICML 2013 paper for the parameter definitions and to the included ReproduceResults script for common parameter settings.

Testing: The command

> LDKLPredict ModelFile TestDataFile [options]

applies the learnt LDKL classifier in ModelFile to the test data in TestFile and saves the predictions subject to the following options

ModelFile : The base name of the files containing the learnt LDKL models. This argument should be the same as the ModelFile argument used during training. For example, if the learnt LDKL models were stored in banana.train.model1, ..., banana.train.model5 then ModelFile should be set to banana.train.
TestDataFile : File containing test data.
-N : [Optional: default 1] Number of models over which LDKL's predictions will be averaged. Note that this cannot be greater than the argument used in -N during training.
-R : [Optional: default Result.txt] Result filename.

Please refer to the README for more details including information about the file formats.

A Toy Example

A toy example has been included with the source code and binaries. Unpack the code and try

> ToyExample

after building the source or copying the binaries from the /bin/OS/ folder to the base folder, where OS is win32 or win64 or linux. This trains the LDKL classifier on the Banana data set. The learnt models will be saved in the Models folder and the results will be saved in the Results folder. You should expect an accuracy close to 89.2%.

Linear SVM (59.18%)	RBF-SVM (90.65%)	LDKL (89.4%)
Fully grown decision tree (87.23%)	Pruned decision tree (84.23%)	Multi-feature decision tree (88.32%)

Note that LDKL's decision boundary and classification accuracy on this data set are similar to that of the RBF-SVM. The decision tree results are not so good as typical decision trees are limited to piecewise linear decision boundaries.

References

1 C. Jose, P. Goyal, P. Aggrwal, and M. Varma, Local deep kernel learning for efficient non-linear svm prediction, in ICML, 2013.
2 A. Rahimi and B. Recht, Random features for large-scale kernel machines, in NIPS, 2007.
3 Q. Le, T. Sarlós, and A. Smola, Fastfood - computing hilbert space expansions in loglinear time, in ICML, 2013.
4 A. Vedaldi and A. Zisserman, Sparse kernel approximation for efficient classification and detection, in CVPR, 2012.
5 T. Yang, Y.-F. Li, M. Mahdavi, R. Jin, and Z.-H. Zhou, Nyström method vs random fourier features: A theoretical and empirical comparison, in NIPS, 2012.
6 T. Joachims and C.-N. J. Yu, Sparse kernel SVMs via cutting-plane training, MLJ, vol. 76, 2009.
7 S. S. Keerthi, O. Chapelle, and D. DeCoste, Building support vector machines with reduced classifier complexity, JMLR, vol. 7, Dec. 2006.
8 A. Cotter, S. Shalev-shwartz, and N. Srebro, Learning optimally sparse support vector machines, in ICML, 2013.

Manik's Home Page

Privacy and Cookies

Trademarks