Support Vector Machine with GPU

Most elementary statistical inference algorithms assume that the data can be modeled by linear parameters with a normally distributed error component. According to Vladimir Vapnik in Statistical Learning Theory (1998), the assumption is inappropriate for modern large scale problems, and his invention of the Support Vector Machine (SVM) makes such assumption unnecessary. There are many implementations of the algorithm, and a popular one is the LIBSVM, which can be invoked in R via the e1071 package.

For demonstration purpose, we will train a regression model based on the California housing prices data from the 1990 Census. The data set is called cadata, and can be downloaded from the LIBSVM site.

Before training the SVM model, we pre-scale the downloaded data in a terminal using a standalone tool rpusvm-scale in RPUSVM. This will create a new data set cadata.scaled. As a good practice, we also save the scale parameters in a secondary file cadata.save for later use.

$ rpusvm-scale -x "-1:1" -y "-1:1" -s cadata.save cadata cadata.scaled

Now we can load cadata.scaled in R with the function read.svm.data in the rpudplus add-on. Since the response values in the data set are not in factor levels, we have to set the argument fac as FALSE. We also save the x and y components as standalone variables for convenience.

> library(rpud) # load rpudplus
> cadata.scaled <- read.svm.data("cadata.scaled", fac=FALSE)
> x <- cadata.scaled$x; y <- cadata.scaled$y

Then we train an SVM regression model using the function svm in e1071. As the data has been pre-scaled, we disable the scale option. The data set has about 20,000 observations, and the training takes over a minute on an AMD Phenom II X4 system.

> library(e1071)
> system.time(cadata.libsvm <-
+   e1071::svm(x, y, type="eps-regression", scale=FALSE))
   user  system elapsed
64.630   0.010  64.659

We can do likewise with the function rpusvm of the rpudplus add-on. The same training now takes only 6 seconds on an NVIDIA GTX 460 GPU:

> system.time(cadata.rpusvm <-
+   rpusvm(x, y, type="eps-regression", scale=FALSE))
.........**.
   user  system elapsed
  6.170   0.020   6.184

The models trained by the two packages are numerically equivalent, as is evidenced by their respective mean square errors. For the LIBSVM model from e1071, the mean square error is about 0.0696.

> res.libsvm <- cadata.libsvm$residuals
> sum(res.libsvm*res.libsvm)/length(res.libsvm)
[1] 0.069568

This is almost identical to the mean square error from the function rpusvm in rpudplus:

> res.rpusvm <- cadata.rpusvm$residuals
> sum(res.rpusvm*res.rpusvm)/length(res.rpusvm)
[1] 0.069566

Sometimes it is more effective to invoke LIBSVM directly in a terminal. Using OpenMP to parallelize LIBSVM v3.1 on an AMD Phenom II X4 CPU, training a regression model of cadata takes about 28 seconds.

$ time svm-train -s 3 -m 1000 cadata.scaled cadata.libsvm
........

optimization finished, #iter = 9677
nu = 0.590304
obj = -2232.720805, rho = -0.299943
nSV = 12216, nBSV = 12156

real    0m28.633s
user    0m28.190s
sys     0m0.390s

Using a standalone Linux tool in RPUSVM, we can invoke the same code of rpusvm in a terminal. The training takes about 5 seconds on a GTX 460 GPU:

$ time rpusvm-train -s 3 cadata.scaled cadata.rpusvm
rpusvm-train 0.1.2
http://www.r-tutor.com
Copyright (C) 2011-2012 Chi Yau. All Rights Reserved.
This software is free for academic use only. There is absolutely NO warranty.

GeForce GTX 460 GPU

.........**.

Finished optimization in 9498 iterations
nu = 0.590179
obj = -2232.72, rho = -0.300649
nSV = 12218, nBSV = 12157
Total nSV = 12218

real    0m5.100s
user    0m4.940s
sys     0m0.150s

Finally, we compare their prediction speeds on cadata. Parallelized LIBSVM takes about 11 seconds on Phenom II X4:

$ time svm-predict cadata.scaled cadata.libsvm cadata.out
Mean squared error = 8500.66 (regression)
Squared correlation coefficient = 0.000325578 (regression)

real    0m11.176s
user    0m44.540s
sys     0m0.010s

The same task takes RPUSVM under 2 seconds on GTX 460:

$ time rpusvm-predict cadata.scaled cadata.rpusvm cadata.out
rpusvm-predict 0.1.2
http://www.r-tutor.com
Copyright (C) 2011-2012 Chi Yau. All Rights Reserved.
This software is free for academic use only. There is absolutely NO warranty.

GeForce GTX 460 GPU

Mean squared error = 0.0695664
Pearson correlation coefficient = 0.698953

real    0m1.631s
user    0m1.440s
sys     0m0.170s

Exercise 1

Train SVM models of larger data sets using rpusvm.

Exercise 2

Find probability estimates of the regression model of cadata by enabling the probability option in rpusvm.

Exercise 3

Perform cross-validation for the regression model of cadata by enabling the cross option in rpusvm.

Exercise 4

Search for optimal SVM kernel and parameters for the regression model of cadata using rpusvm based on similar procedures explained in the text A Practical Guide to Support Vector Classification. In particular, create a similar contour map as below for selecting smaller regions for further optimization.

Note 1

Suppose we would like to perform prediction on a data file stored in LIBSVM format, say test.dat. We must first pre-scale it with the scale parameter file cadata.save which we created earlier in preparation of training cadata.

$ rpusvm-scale -r cadata.save test.dat test.scaled

Then we load it in R with the read.svm.data method in rpudplus and apply the function predict as usual. Just make sure to manually restore the result to the original y-scale before use.

> test.scaled <- read.svm.data("test.scaled", fac=FALSE)
> pred <- predict(cadata.rpusvm, test.scaled$x)
> head(pred)
1 2 3 4 5
0.592945 0.782728 0.557078 0.299014 -0.053179

These data scaling juggernauts can be avoided with the latest rpudplus and RPUSVM. See our next tutorial for details.

Note 2

A much faster algorithm for large scale document classification without the use of a GPU is LIBLINEAR. It can process millions of records in seconds.

References

Christopher J. C. Burges: A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery, 2 (1998) 121-167
Chih-Chung Chang and Chih-Jen Lin: LIBSVM - A Library for Support Vector Machines. www.csie.ntu.edu.tw/ cjlin/libsvm/

Tags:

An R Introduction to Statistics