Support Vector Machine with GPU
Most elementary statistical inference algorithms assume that the data can be modeled by linear parameters with a normally distributed error component. According to Vladimir Vapnik in Statistical Learning Theory (1998), the assumption is inappropriate for modern large scale problems, and his invention of the Support Vector Machine (SVM) makes such assumption unnecessary. There are many implementations of the algorithm, and a popular one is the LIBSVM, which can be invoked in R via the e1071 package.
For demonstration purpose, we will train a regression model based on the California housing prices data from the 1990 Census. The data set is called cadata, and can be downloaded from the LIBSVM site.
Before training the SVM model, we pre-scale the downloaded data in a terminal using a standalone tool rpusvm-scale in RPUSVM. This will create a new data set cadata.scaled. As a good practice, we also save the scale parameters in a secondary file cadata.save for later use.
Now we can load cadata.scaled in R with the function read.svm.data in the rpudplus add-on. Since the response values in the data set are not in factor levels, we have to set the argument fac as FALSE. We also save the x and y components as standalone variables for convenience.
> cadata.scaled <- read.svm.data("cadata.scaled", fac=FALSE)
> x <- cadata.scaled$x; y <- cadata.scaled$y
Then we train an SVM regression model using the function svm in e1071. As the data has been pre-scaled, we disable the scale option. The data set has about 20,000 observations, and the training takes over a minute on an AMD Phenom II X4 system.
> system.time(cadata.libsvm <-
+ e1071::svm(x, y, type="eps-regression", scale=FALSE))
user system elapsed
64.630 0.010 64.659
We can do likewise with the function rpusvm of the rpudplus add-on. The same training now takes only 6 seconds on an NVIDIA GTX 460 GPU:
+ rpusvm(x, y, type="eps-regression", scale=FALSE))
.........**.
user system elapsed
6.170 0.020 6.184
The models trained by the two packages are numerically equivalent, as is evidenced by their respective mean square errors. For the LIBSVM model from e1071, the mean square error is about 0.0696.
> sum(res.libsvm*res.libsvm)/length(res.libsvm)
[1] 0.069568
This is almost identical to the mean square error from the function rpusvm in rpudplus:
> sum(res.rpusvm*res.rpusvm)/length(res.rpusvm)
[1] 0.069566
Sometimes it is more effective to invoke LIBSVM directly in a terminal. Using OpenMP to parallelize LIBSVM v3.1 on an AMD Phenom II X4 CPU, training a regression model of cadata takes about 28 seconds.
........
optimization finished, #iter = 9677
nu = 0.590304
obj = -2232.720805, rho = -0.299943
nSV = 12216, nBSV = 12156
real 0m28.633s
user 0m28.190s
sys 0m0.390s
Using a standalone Linux tool in RPUSVM, we can invoke the same code of rpusvm in a terminal. The training takes about 5 seconds on a GTX 460 GPU:
rpusvm-train 0.1.2
http://www.r-tutor.com
Copyright (C) 2011-2012 Chi Yau. All Rights Reserved.
This software is free for academic use only. There is absolutely NO warranty.
GeForce GTX 460 GPU
.........**.
Finished optimization in 9498 iterations
nu = 0.590179
obj = -2232.72, rho = -0.300649
nSV = 12218, nBSV = 12157
Total nSV = 12218
real 0m5.100s
user 0m4.940s
sys 0m0.150s
Finally, we compare their prediction speeds on cadata. Parallelized LIBSVM takes about 11 seconds on Phenom II X4:
Mean squared error = 8500.66 (regression)
Squared correlation coefficient = 0.000325578 (regression)
real 0m11.176s
user 0m44.540s
sys 0m0.010s
The same task takes RPUSVM under 2 seconds on GTX 460:
rpusvm-predict 0.1.2
http://www.r-tutor.com
Copyright (C) 2011-2012 Chi Yau. All Rights Reserved.
This software is free for academic use only. There is absolutely NO warranty.
GeForce GTX 460 GPU
Mean squared error = 0.0695664
Pearson correlation coefficient = 0.698953
real 0m1.631s
user 0m1.440s
sys 0m0.170s
Exercise 1
Train SVM models of larger data sets using rpusvm.
Exercise 2
Find probability estimates of the regression model of cadata by enabling the probability option in rpusvm.
Exercise 3
Perform cross-validation for the regression model of cadata by enabling the cross option in rpusvm.
Exercise 4
Search for optimal SVM kernel and parameters for the regression model of cadata using rpusvm based on similar procedures explained in the text A Practical Guide to Support Vector Classification. In particular, create a similar contour map as below for selecting smaller regions for further optimization.
Note 1
Suppose we would like to perform prediction on a data file stored in LIBSVM format, say test.dat. We must first pre-scale it with the scale parameter file cadata.save which we created earlier in preparation of training cadata.
Then we load it in R with the read.svm.data method in rpudplus and apply the function predict as usual. Just make sure to manually restore the result to the original y-scale before use.
> pred <- predict(cadata.rpusvm, test.scaled$x)
> head(pred)
1 2 3 4 5
0.592945 0.782728 0.557078 0.299014 -0.053179
These data scaling juggernauts can be avoided with the latest rpudplus and RPUSVM. See our next tutorial for details.
Note 2
A much faster algorithm for large scale document classification without the use of a GPU is LIBLINEAR. It can process millions of records in seconds.
References
- Christopher J. C. Burges: A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery, 2 (1998) 121-167
- Chih-Chung Chang and Chih-Jen Lin: LIBSVM - A Library for Support Vector Machines. www.csie.ntu.edu.tw/ cjlin/libsvm/