An R Introduction to Statistics

Support Vector Machine with GPU

Most elementary statistical inference algorithms assume that the data can be modeled by linear parameters with a normally distributed error component. According to Vladimir Vapnik in Statistical Learning Theory (1998), the assumption is inappropriate for modern large scale problems, and his invention of the Support Vector Machine (SVM) makes such assumption unnecessary. There are many implementations of the algorithm, and a popular one is the LIBSVM, which can be invoked in R via the e1071 package.

For demonstration purpose, we will train a regression model based on the California housing prices data from the 1990 Census. The data set is called cadata, and can be downloaded from the LIBSVM site.

Before training the SVM model, we pre-scale the downloaded data in a terminal using a standalone tool rpusvm-scale in RPUSVM. This will create a new data set cadata.scaled. As a good practice, we also save the scale parameters in a secondary file cadata.save for later use.

$ rpusvm-scale -x "-1:1" -y "-1:1" -s cadata.save cadata cadata.scaled

Now we can load cadata.scaled in R with the function read.svm.data in the rpudplus add-on. Since the response values in the data set are not in factor levels, we have to set the argument fac as FALSE. We also save the x and y components as standalone variables for convenience.

> library(rpud)                     # load rpudplus 
> cadata.scaled <- read.svm.data("cadata.scaled", fac=FALSE) 
> x <- cadata.scaled$x; y <- cadata.scaled$y

Then we train an SVM regression model using the function svm in e1071. As the data has been pre-scaled, we disable the scale option. The data set has about 20,000 observations, and the training takes over a minute on an AMD Phenom II X4 system.

> library(e1071) 
> system.time(cadata.libsvm <- 
+   e1071::svm(x, y, type="eps-regression", scale=FALSE)) 
   user  system elapsed 
 64.630   0.010  64.659

We can do likewise with the function rpusvm of the rpudplus add-on. The same training now takes only 6 seconds on an NVIDIA GTX 460 GPU:

> system.time(cadata.rpusvm <- 
+   rpusvm(x, y, type="eps-regression", scale=FALSE)) 
.........**
   user  system elapsed 
  6.170   0.020   6.184

The models trained by the two packages are numerically equivalent, as is evidenced by their respective mean square errors. For the LIBSVM model from e1071, the mean square error is about 0.0696.

> res.libsvm <- cadata.libsvm$residuals 
> sum(res.libsvm*res.libsvm)/length(res.libsvm) 
[1] 0.069568

This is almost identical to the mean square error from the function rpusvm in rpudplus:

> res.rpusvm <- cadata.rpusvm$residuals 
> sum(res.rpusvm*res.rpusvm)/length(res.rpusvm) 
[1] 0.069566

Sometimes it is more effective to invoke LIBSVM directly in a terminal. Using OpenMP to parallelize LIBSVM v3.1 on an AMD Phenom II X4 CPU, training a regression model of cadata takes about 28 seconds.

$ time svm-train -s 3 -m 1000 cadata.scaled cadata.libsvm 
........ 
 
optimization finished, #iter = 9677 
nu = 0.590304 
obj = -2232.720805, rho = -0.299943 
nSV = 12216, nBSV = 12156 
 
real    0m28.633s 
user    0m28.190s 
sys     0m0.390s

Using a standalone Linux tool in RPUSVM, we can invoke the same code of rpusvm in a terminal. The training takes about 5 seconds on a GTX 460 GPU:

$ time rpusvm-train -s 3 cadata.scaled cadata.rpusvm 
rpusvm-train 0.1.2 
http://www.r-tutor.com 
Copyright (C) 2011-2012 Chi Yau. All Rights Reserved. 
This software is free for academic use only. There is absolutely NO warranty. 
 
GeForce GTX 460 GPU 
 
.........**
 
Finished optimization in 9498 iterations 
nu = 0.590179 
obj = -2232.72, rho = -0.300649 
nSV = 12218, nBSV = 12157 
Total nSV = 12218 
 
real    0m5.100s 
user    0m4.940s 
sys     0m0.150s

Finally, we compare their prediction speeds on cadata. Parallelized LIBSVM takes about 11 seconds on Phenom II X4:

$ time svm-predict cadata.scaled cadata.libsvm cadata.out 
Mean squared error = 8500.66 (regression) 
Squared correlation coefficient = 0.000325578 (regression) 
 
real    0m11.176s 
user    0m44.540s 
sys     0m0.010s

The same task takes RPUSVM under 2 seconds on GTX 460:

$ time rpusvm-predict cadata.scaled cadata.rpusvm cadata.out 
rpusvm-predict 0.1.2 
http://www.r-tutor.com 
Copyright (C) 2011-2012 Chi Yau. All Rights Reserved. 
This software is free for academic use only. There is absolutely NO warranty. 
 
GeForce GTX 460 GPU 
 
Mean squared error = 0.0695664 
Pearson correlation coefficient = 0.698953 
 
real    0m1.631s 
user    0m1.440s 
sys     0m0.170s

Exercise 1

Train SVM models of larger data sets using rpusvm.

Exercise 2

Find probability estimates of the regression model of cadata by enabling the probability option in rpusvm.

Exercise 3

Perform cross-validation for the regression model of cadata by enabling the cross option in rpusvm.

Exercise 4

Search for optimal SVM kernel and parameters for the regression model of cadata using rpusvm based on similar procedures explained in the text A Practical Guide to Support Vector Classification. In particular, create a similar contour map as below for selecting smaller regions for further optimization.

PIC

Note 1

Suppose we would like to perform prediction on a data file stored in LIBSVM format, say test.dat. We must first pre-scale it with the scale parameter file cadata.save which we created earlier in preparation of training cadata.

$ rpusvm-scale -r cadata.save test.dat test.scaled

Then we load it in R with the read.svm.data method in rpudplus and apply the function predict as usual. Just make sure to manually restore the result to the original y-scale before use.

> test.scaled <- read.svm.data("test.scaled", fac=FALSE) 
> pred <- predict(cadata.rpusvm, test.scaled$x) 
> head(pred) 
        1         2         3         4         5 
 0.592945  0.782728  0.557078  0.299014 -0.053179

These data scaling juggernauts can be avoided with the latest rpudplus and RPUSVM. See our next tutorial for details.

Note 2

A much faster algorithm for large scale document classification without the use of a GPU is LIBLINEAR. It can process millions of records in seconds.

References

  1. Christopher J. C. Burges: A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery, 2 (1998) 121-167
  2. Chih-Chung Chang and Chih-Jen Lin: LIBSVM - A Library for Support Vector Machines. www.csie.ntu.edu.tw/ cjlin/libsvm/