Jan 29, 2011

Getting started with LibSVM.

Intro

This page is a tutorial on how to use LibSVM to a solve a simple, real world problem using a bash shell.

Process

Download LibSVM.
Untar and run make. ./configure not necessary.
Read Section A of this article. We are going to try the example in section A.1.
Download the training data here.
Scale the data: ./libsvm-3.0/svm-scale -l -1 -u 1 -s range1 svmguide1 > svmguide1.scale. This forces the dataset to be between -1 and 1. The name of the scale is range1 which will be used later.
Find the ideal parameter values: cd libsvm-3.0/tools/; python grid.py ../../svmguide1.scale; cd -. A plot will appear. this process takes significantly longer than the others. Just keep waiting. You have to cd to the tools directory for the python script to run properly. I got the result of 2.0 2.0 96.9893. I believe the first parameter is c and g is the second.
Using the above parameters, create the model: libsvm-3.0/svm-train -c 2 -g 2 svmguide1.scale. There will now be a svmguide1.scale.model file in your current directory.
Download the test data here.
Scale the test data: ./libsvm-3.0/svm-scale -r range1 svmguide1.t > svmguide1.t.scale
Run the test: ./libsvm-3.0/svm-predict svmguide1.t.scale svmguide1.scale.model svmguide1.t.predict
You should see an accuracy of “96.875%”.

Note: it would be easier to run tools/easy.py but this helps get an understanding of what it does underneath.

Data Files

The data files are a space-delimited array of tuples that start with the output classification. A sample line looks like this:

0 1:6.640930e-01 2:2.540829e+01 3:-1.155107e-01 4:4.739574e+01.

The above line shows a result for category “0”. The first parameter is 0.6640930, but is represented with the parameter number, a colon, and the number itself in scientific notation. The number does not have to be in scientific notation and is usually not in that form after it is scaled.

Note that both training and test data have the exact same format. An interesting test is to run the model on the training data. I tried it on the above and got 97.1188% accuracy. If the accuracy is bad on the training data, there are probably serious issues.

Scaled Data

The scaling information for the scaled files are not stored in the data file. Instead there is a special file to store this information. svm-scale uses the -s parameter to save this special file and the -r parameter to read the file and apply the same scaling as the training data. Here are the contents of the file produced by the above commands:

x
-1 1
1 0 297.05
2 -4.555206 581.0731
3 -0.7524385 0.7170606
4 8.157474000000001 180

Details

The article we used as a basis was clear on the point to be sure to scale and optimize the parameters. In this case, parameter optimization didn’t add much accuracy but other examples in that article show that it can be a huge difference.

Prediction

Training and prediction are basically the same command. The predicted results are stored in yet another file. This file just contains the predicted values for each line of input. I made my own set of test data here:

0 1:2.612900e+00 2:1.832880e+01 3:1.222401e-01 4:7.389243e+01
0 1:2.004300e+01 2:2.822780e+02 3:2.021259e-01 4:1.411875e+02

The first line was taken from a “0” and the second line was taken from a “1” in the sample data set. As expected, svm-predict gave an accuracy of 50% and the predict’s output file has a 0 and 1 for the expecting line. I get an error when trying a letter or a “?” mark in the first character of every line. If I put a space for the first character, I get a 100% accuracy result and the predict file shows all “1’s”..beware of this wrong answer.

The best strategy when doing prediction is to simply leave the first character of every line as a “0”.