Teaching the Toaster to Feel Love [Part 2]

Welcome to part two of my series of posts on teaching a toaster to feel love, or introduction to machine learning for the rest of us.

If you missed part one, you can read Part 1.

Last time I thought in part two I would be able to start posting some code.  However after writing a good chunk of the code I wanted to include in part two, I realized it was way too much to cover in one post.  So I’m going use this post to introduce some function calls and classes, but focus mostly on the process.

Download

A fairly popular SVM library is LibSVM, which you can download on this page here, or directly here.  There are several variations of that library in many languages.  It’s what I will be referencing in this and in coming posts.

Features

The first and most important step when using an SVM is to figure out what constitutes a good feature vector.  If you remember from part one, a feature is just a floating point value with importance to the programmer but is just another number in a dimension for the SVM.  The feature vector is the collection of these features, constituting a single example.

The features that your feature vector is composed of is the real secret sauce, everything else in the process is more or less mechanical.  Sadly it also means I can’t tell you what to put into your feature vector because everyone’s problem needs a different vector.  However, there are lots of tactics that apply in all cases that you should remember.

Normalization

You need to remember to normalize your feature data.  I don’t mean just normalizing the feature vector like you would a float4. I mean, each feature should be normalized either from 0..1 or –1..1 (not both).  For two important reasons, numerical stability and weighting.

Numerical Stability – We don’t want to get into situations where we might be hitting up against the 32bit limit with very large numbers during the testing process.

Weighting – We also don’t want one dimension to greatly out weigh any other dimension accidentally.  Because an SVM is attempting to find the hyperplane with the greatest margin that divides the data, having one dimension in the feature vector run from 0..1000 while everything else runs from 0..1, would mean that almost all your data would be classified based on that one feature because it might find a hyperplane that gave it a margin of 500 in that one dimension while every other feature’s margin added together doesn’t even approach 500.  So that feature more than any of the others would be what determined classification.

Numericalization

Not all kinds of data that you may want to include in a feature vector exists naturally as a number.  Sometimes you need to transform a category into a feature.  Your first thought might be to divide the category number by the number of categories and voilà, a floating point number.  However, this is not the best approach.  The better approach is to represent each category like you would if you wanted to represent them as bit fields, and dedicate one dimension for each category.  So if there were 3 possible categories, you’d represent the first category with 3 features, 0, 0, 1, the second category as 0, 1, 0, and the third as 1, 0, 0.

Rephrasing

It’s important to remember that a machine doesn’t have insight.  If I gave you position and time data, you could tell me other information like velocity and acceleration.  However an SVM doesn’t know what the data is, so it can’t derive any sort of insightful observations about it.  So it’s important to remember to rephrase the same data in different ways that tell the machine new things about the same data.

Ordering

The order of the features does not influence the process, however you need to keep your ordering consistent.

Size

Your feature vectors should all have the same number of features.

Training

The second step in the SVM process is training the SVM model using the feature vectors you’ve built containing all the useful features you think will help distinguish one thing from another thing.

To train an SVM you need 2 sets of data; the training set and the validation set.  While you can use just a training dataset, it’s not advisable as it can lead to a problem known as ‘over fitting’.  Which is just a fancy way of saying, you’ve created an SVM that ONLY recognizes the training data and doesn’t handle variation very well.

The most common and easiest training methodology to use for beginners with an SVM is known as a grid search.  The process is pretty simple, to determine the best support vectors we’ll iterate over a series of parameter magic number combinations (Cost (C) and Gamma) until we find a pair that seeds the process of dividing of our data the best.  The kernel type we’ll be using to divide our data will be the Radial Basis Function (RBF) kernel.  You could devise any kind of kernel you wanted to divide the data, however there’s a bunch of existing kernel functions people tend to re-use because they work in a wide number of situations.

The Cost (C) parameter represents the penalty for miss-classification in the training process.  The higher the cost, the more rigid/strict the trained model will be for the training data.  Meaning that while you may see fewer miss classified items (for data that resembles closely the training set), the model will be much more likely ‘over fit’ the problem.

Gamma is a parameter that’s used directly in the RBF kernel function.

The first step in the training process is to create an svmproblem object.  It holds all the training data in the form of svmnodes.  Then we define our svmparameter, which represents the kernel type and all the other parameters used in the training and creation process of our svmmodel.

To determine the best parameters in the grid search we’ll use the svmcrossvalidation method that’s built into SVMLib.  Which allows us to take the ground truth training data and see how well it would be classified given the resulting model from the selected Gamma and C parameters.

Now, you don’t have to vary JUST the C and Gamma parameters you could try different kernel functions.  But the more things you vary in the grid search the longer the training process is going to take.  Which, depending on the size of your dataset may be hours.  It all depends on how many different combinations of parameters you are willing to try.

Side Note: Should you ever run into a training time problem, I highly recommend looking into ways to running the training process in distributed environment.  The training process for an SVM is VERY amenable to being parallelized.  There are also some GPU (CUDA) implementations of the SVM that I’ve read about but not yet tried.  You can always use one SVM implementation for finding the best parameters and another for runtime usage.

Once you’ve determined the best C and Gamma based on the miss-classifications from your svmcrossvalidation grid search process.  You’ll just need to call svm_train to generate the final model that you can save out and re-use at runtime to classify datasets that you want the SVM to identify for you.

Once you have your model you’ll want to run your validation dataset against it to see how well it runs on data that wasn’t used to train it.  If you find that the model performs very poorly you’ve over-fit to the training data.  So you may want to save the N best C and Gamma parameters you found and actually generate N final svm_models and pick the one that performs the best on the validation dataset.

At the very least, having a validation dataset allows you to make sure there’s no regressions in the unit tests of your code.

Next Time

Next time I should be able to walk you through the code easier and just reference concepts introduced in this post.