« Posts tagged Programming

SWIG and a Miss

There are so many pitfalls you’ll encounter after using SWIG for any extended period of time or with a large enough codebase.  I thought I would go over some of the more notable ones I’ve encountered creating a C# wrapper that had me sighing and doubling my caffeine intake.  That said, I do like SWIG.  If you color within the lines it works quite well and has in the end saved me more time than it has cost me.

I’m still debating if it’s worth investing time into my own cleaner/purpose built wrapper generator.  I’d like to hear some thoughts from others who have had to deal with this decision and what you chose.  If you went with writing your own, did you use any helpful libraries (like for parsing C++)?  Also, feel free to add to the this list in the comments if you’ve run into another notable problem you ran into with SWIG.

1) Consistent Generator Configuration

Using different build configurations for code generation is a bad idea.  Even though your generated code will compile using different build configurations, the #defines you set on the swig command line (-DMY_DEFINE) should be consistent in all build configurations.  Varying the generator configuration can lead to things like stale files (unless you rebuild every time), as well as an inconsistent wrapper interface.

2) Automatic Macro Expansion Is Evil

If you have a macro that does different things based upon Debug or Release SWIG will pick the one matching the -D command line define and expand the code to match that.  It won’t just insert your macro as you would if you were writing the code naturally.  The best way I’ve found to prevent this from happening is just block off the macros it’s trying to expand.

C++
#if !defined(SWIG)
    #define MYNEW ...
    #define MYDELETE ...
#endif

3) Const Correctness

C# does not have a way of mimicking C++’s const’ness.  So objects returned in C++ from a function as const Type& are treated as a regular pointer.  Allowing your wrapper user to try and change the underlying data in the object even though it’s suppose to be const.  To avoid this, you should probably remove the function entirely.  Alternatively you can treat it as a new object and generate a new C++ object on the heap and copy it there so that any edits made to it wont be reflected in the actual object.  But that has many negative sides also, like not being able to see changes in the object the reference was for.  You also create a memory leak since SWIG has no idea it has created a new object, so you need to find a way to tell swig you created new memory.  To create a wrapper class for it you can use the %typemap(out) macro like this,

C++
%typemap(out) const Vector3& %{ $result = new Vector3(*$1); %}

4) Return Reference

By default SWIG treats TYPE & as a pointer.  Which is in some sense good and in others horrible.  Like it makes perfect sense for a List class to have a function called Get(int index) that returns the TYPE& to the object in the list.  In a case like that, treating it as a pointer is fantastic.  However, suppose you have the following C++ class,

C++
class Matrix
{
    Matrix();
    Matrix& Identity();
}

Then in C# you do the following,

C++
Matrix4 wrappedMatrix = new Matrix4();
return wrappedMatrix.Identity();

Creating the Matrix4 from C# sets the flag that SWIG needs to delete the underlying native Matrix4 instance when this object goes out of scope.

Next we call Identity() which using the default C# generated code will return a pointer to the same native Matrix4 instance, but a new managed C# wrapper class will be used to wrap it.  This new C# class will have the flag set to NOT clean up the memory since it’s just holding onto a pointer and wasn’t responsible for creating it.

Finally we return the C# object created using the Identity() method, which is a different instance than the one wrappedMatrix points to, even though they both point to the same native class.  Then we return from the scope we are in, wrappedMatrix is garbage collected, the Identity() spawned matrix continues to live on in managed code, but is now pointing at a completely bogus location in memory since wrappedMatrix was responsible for deleting the native object and did so.

So how do you solve this one?  The easiest way is the same as #3, safe bet is just to not have those functions wrapped, instead provide functions that return void.  Ideally, SWIG would provide some way for me to markup the class and say,

C++
SWIG_THIS Matrix& Identity(); // SWIG_THIS doesn't actually exist.

SWIG_THIS would indicate that I’m just returning the same object, so instead of creating a new wrapper C# class when you return from the wrapped Identity() function, just return "this".

5) Customizable Allocation

There’s no standardized way of changing the allocation mechanism SWIG uses.  It will create/destroy arrays using stock malloc and free.  It will create/destroy everything else using new/delete.  The way around this is to modify the SWIG script files it uses to generate the default code. (swig\Lib\*.swg,*.i).  In particular, carrays.i for changing array allocation.  For object creation, you’ll have to modify swigs scripts for your destination language of choice.  For C# you can find them here (swig\Lib\csharp\csharp.swg).

6) Alignment Fail

SWIG and aligned data types don’t get along together.  Presumably you have your own allocator and have either A) Have overridden global new/delete, B) Created a common base class with new/delete functions, or C) Created a macro to use on every class you want to override new/delete on.  If you have then in the case of object creation, you’re safe.  Though you may want to replace SWIGs use of new/delete anyway with your new/delete macros as I am sure you’d like to make sure you track exactly where those allocations came from.  If you have any arrays SWIG creates, you’ll definitely need to replace their usage of malloc and free with one going through your allocator so that you can handle alignment.

7) Nested Types

SWIG doesn’t handle nested data types.  I haven’t found a way around it, you just have to pull them out of the parent class if you want them wrapped.

8) Where Are The Out/Ref Parameter Flags

SWIG doesn’t map method parameters that are passed by (non-const) reference as "out" or "ref" parameters in C#.  The way around this is to do the following,

C++
%include "typemaps.i"
 
%define %TypeOutParam(TYPE)
    %apply TYPE& OUTPUT { TYPE& };
%enddef
 
%TypeOutParam(bool)
%TypeOutParam(int)
%TypeOutParam(float)
...

That will make all instances of those types being used in the context void MyMethod(int& x, int& y) would be mapped to void MyMethod(out int x, out int y) in C#.  Alternatively if I had done,

C++
%include "typemaps.i"
 
%define %TypeRefParam(TYPE)
    %apply TYPE& INOUT { TYPE& };
%enddef
 
%TypeRefParam(bool)
%TypeRefParam(int)
%TypeRefParam(float)
...

The method void MyMethod(int& x, int& y) would be mapped to void MyMethod(ref int x, ref int y) in C#.

9) IntPtr == void*

For some unknown reason SWIG does not map void* to anything useful in C#.  So I came up with the following to map it to the IntPtr struct in C#.

C++
%typemap(ctype)  void * "void *"
%typemap(imtype) void * "IntPtr"
%typemap(cstype) void * "IntPtr"
%typemap(csin)   void * "$csinput"
%typemap(in)     void * %{ $1 = $input; %}
%typemap(out)    void * %{ $result = $1; %}
%typemap(csout)  void * { return $imcall; }

10) string[] –> char**

For C# string[] is not mapped to char**.  So here is how you map it,

C++
%typemap(ctype) char** "char**"
%typemap(imtype) char** "string[]"
%typemap(cstype) char** "string[]"
 
%typemap(csin) char** "$csinput"
%typemap(csout, excode=SWIGEXCODE) char**, const char**& {
    int ret = $imcall;$excode
    return ret;
  }
%typemap(csvarin, excode=SWIGEXCODE2) char** %{
    set {
      $imcall;$excode
    } %}
%typemap(csvarout, excode=SWIGEXCODE2) char** %{
    get {
      int ret = $imcall;$excode
      return ret;
    } %}
 
%typemap(in) char** %{ $1 = $input; %}
%typemap(out) char** %{ $result = $1; %}

11) Operators

Operators are not wrapped by default.  There is probably a way to map them to operators in C#, but I went with another route. You can rename them so that they are, for example

C++
%rename(Add) Vector3::operator +;

12) Memory Lifecycle Management

SWIG has this internal flag in generated classes that tells it if it needs to clean up the memory for the native object when the instance is garbage collected by .Net.  If you have pesky life-cycle management requirements in your system you’ll need to be able to control this flag on the fly.  The way I’ve found to do this is with the %typemap macro,

C++
%typemap(cscode) YOUR_CLASS
%{
    public bool IsLifecycleManaged
    {
        get { return swigCMemOwn; }
        set { swigCMemOwn = value; }
    }
%}

Now you can control on a per instance basis whether or not .Net garbage collecting the object will clean it up.

13) Callbacks

If you need to do callbacks use the "%feature("director")".  The documentation explains it more in depth, but it wasn’t clear to me at first how to do it until I learned about directors.

14) Extension Macro

You can extend/add functionality to the C++ class prior to the wrapper code getting generated using the "%extend" macro.  The extension code is added as a C function though, so you might wonder, how do I access the class that this function is supposed to be part of.  The code SWIG generates refers to the ‘this’ pointer as the variable ‘self’ so you can access anything public on the class by just doing

C++
self->...

15) Unfriendly Template Class Names

SWIG generates unfriendly names for templated classes.  The best way around this is to use the %template macro.  It works like this,

C++
%template(ListFloat) List<float>;

16) Tracking New Memory

If a function creates a new object and returns it, you need to tell SWIG this, otherwise you’ll have a memory leak.  By default SWIG does not destroy any object it wraps unless it created a temporary object in the heap for a value type on the stack.  The way you inform swig a method returns a new object is by doing,

C++
%newobject MyFactory::Create;

17) Casting

As it stands I don’t have a good solution to this one. I thought I should mention it though, because it has been a real thorn in my side. If you have a method in C++ that you wrap and you return a base class like BaseObject*. SWIG will wrap that inside a BaseObject C# class. Now all the other SWIG wrapped classes that derived from BaseObject will exist and will similarly inherit from a C# BaseObject.

However, attempting to cast your new C# BaseObject into anything other than its parent classes will result in an exception. Because SWIG has no way of knowing the true type of the BaseObject* it returned from the function. So the C# object it creates is going to be BaseObject and not the actual C# wrapper class for the native types true type.

SWIG and a Miss @ altdevblogaday.com

Teaching the Toaster to Feel Love [Part 2]

toaster1

Welcome to part two of my series of posts on teaching a toaster to feel love, or introduction to machine learning for the rest of us.

If you missed part one, you can read it here.

Last time I thought in part two I would be able to start posting some code.  However after writing a good chunk of the code I wanted to include in part two, I realized it was way too much to cover in one post.  So I’m going use this post to introduce some function calls and classes, but focus mostly on the process.

Download

A fairly popular SVM library is LibSVM, which you can download on this page here, or directly here.  There are several variations of that library in many languages.  It’s what I will be referencing in this and in coming posts.

Features

The first and most important step when using an SVM is to figure out what constitutes a good feature vector.  If you remember from part one, a feature is just a floating point value with importance to the programmer but is just another number in a dimension for the SVM.  The feature vector is the collection of these features, constituting a single example.

The features that your feature vector is composed of is the real secret sauce, everything else in the process is more or less mechanical.  Sadly it also means I can’t tell you what to put into your feature vector because everyone’s problem needs a different vector.  However, there are lots of tactics that apply in all cases that you should remember.

Normalization

You need to remember to normalize your feature data.  I don’t mean just normalizing the feature vector like you would a float4. I mean, each feature should be normalized either from 0..1 or –1..1 (not both).  For two important reasons, numerical stability and weighting.

Numerical Stability – We don’t want to get into situations where we might be hitting up against the 32bit limit with very large numbers during the testing process.

Weighting – We also don’t want one dimension to greatly out weigh any other dimension accidentally.  Because an SVM is attempting to find the hyperplane with the greatest margin that divides the data, having one dimension in the feature vector run from 0..1000 while everything else runs from 0..1, would mean that almost all your data would be classified based on that one feature because it might find a hyperplane that gave it a margin of 500 in that one dimension while every other feature’s margin added together doesn’t even approach 500.  So that feature more than any of the others would be what determined classification.

Numericalization

Not all kinds of data that you may want to include in a feature vector exists naturally as a number.  Sometimes you need to transform a category into a feature.  Your first thought might be to divide the category number by the number of categories and voilà, a floating point number.  However, this is not the best approach.  The better approach is to represent each category like you would if you wanted to represent them as bit fields, and dedicate one dimension for each category.  So if there were 3 possible categories, you’d represent the first category with 3 features, 0, 0, 1, the second category as 0, 1, 0, and the third as 1, 0, 0.

Rephrasing

It’s important to remember that a machine doesn’t have insight.  If I gave you position and time data, you could tell me other information like velocity and acceleration.  However an SVM doesn’t know what the data is, so it can’t derive any sort of insightful observations about it.  So it’s important to remember to rephrase the same data in different ways that tell the machine new things about the same data.

Ordering

The order of the features does not influence the process, however you need to keep your ordering consistent.

Size

Your feature vectors should all have the same number of features.

Training

The second step in the SVM process is training the SVM model using the feature vectors you’ve built containing all the useful features you think will help distinguish one thing from another thing.

spceaserTo train an SVM you need 2 sets of data; the training set and the validation set.  While you can use just a training dataset, it’s not advisable as it can lead to a problem known as ‘over fitting’.  Which is just a fancy way of saying, you’ve created an SVM that ONLY recognizes the training data and doesn’t handle variation very well.

The most common and easiest training methodology to use for beginners with an SVM is known as a grid search.  The process is pretty simple, to determine the best support vectors we’ll iterate over a series of parameter magic number combinations (Cost (C) and Gamma) until we find a pair that seeds the process of dividing of our data the best.  The kernel type we’ll be using to divide our data will be the Radial Basis Function (RBF) kernel.  You could devise any kind of kernel you wanted to divide the data, however there’s a bunch of existing kernel functions people tend to re-use because they work in a wide number of situations.

The Cost (C) parameter represents the penalty for miss-classification in the training process.  The higher the cost, the more rigid/strict the trained model will be for the training data.  Meaning that while you may see fewer miss classified items (for data that resembles closely the training set), the model will be much more likely ‘over fit’ the problem.

Gamma is a parameter that’s used directly in the RBF kernel function.

The first step in the training process is to create an svm_problem object.  It holds all the training data in the form of svm_nodes.  Then we define our svm_parameter, which represents the kernel type and all the other parameters used in the training and creation process of our svm_model.

To determine the best parameters in the grid search we’ll use the svm_cross_validation method that’s built into SVMLib.  Which allows us to take the ground truth training data and see how well it would be classified given the resulting model from the selected Gamma and C parameters.

Now, you don’t have to vary JUST the C and Gamma parameters you could try different kernel functions.  But the more things you vary in the grid search the longer the training process is going to take.  Which, depending on the size of your dataset may be hours.  It all depends on how many different combinations of parameters you are willing to try.

Side Note: Should you ever run into a training time problem, I highly recommend looking into ways to running the training process in distributed environment.  The training process for an SVM is VERY amenable to being parallelized.  There are also some GPU (CUDA) implementations of the SVM that I’ve read about but not yet tried.  You can always use one SVM implementation for finding the best parameters and another for runtime usage.

Once you’ve determined the best C and Gamma based on the miss-classifications from your svm_cross_validation grid search process.  You’ll just need to call svm_train to generate the final model that you can save out and re-use at runtime to classify datasets that you want the SVM to identify for you.

Once you have your model you’ll want to run your validation dataset against it to see how well it runs on data that wasn’t used to train it.  If you find that the model performs very poorly you’ve over-fit to the training data.  So you may want to save the N best C and Gamma parameters you found and actually generate N final svm_models and pick the one that performs the best on the validation dataset.

At the very least, having a validation dataset allows you to make sure there’s no regressions in the unit tests of your code.

Next Time

Next time I should be able to walk you through the code easier and just reference concepts introduced in this post.

voilà

GeSHi / HLSL

One of the plugins I use for my blog is wp-syntax, which is a pretty fantastic plugin for embedding code in posts.  It uses GeSHi as its syntax highlighter to process the code, but it lacked a scheme for HLSL.  So I hacked out my own and had intended to create a post about it to share with others.  I finally remembered to do that, so feel free to snag it from here.  To use it in a wordpress blog, you just need to extract the php file and place it here “wp-content\plugins\wp-syntax\geshi\geshi” with the other GeSHi syntax php files.

Teaching the Toaster to Feel Love [Part 1]

FarnsworthEvery so often you find yourself confronted with a problem that you as a human can answer, but to express it algorithmically in code would be very difficult.  In these times people often turn to machine learning to solve their problem.  However, machine learning is a enormous field and it’s difficult finding where to even begin.

As new devices are developed to interface with humans for games and other applications; accelerometers, gyroscopes, depth cameras with skeletal tracking and who knows what next.  Interpreting the input reliably has become a much more difficult task.

So I wanted to write a series of posts which hopefully will benefit anyone wanting to learn how to use one variety of machine learning; Support Vector Machines (SVM).

Introduction

An SVM makes a binary decision about a feature vector deciding if it is closer to the positive class or the negative class.  During the training process for an SVM it’s given many feature vectors that exist as examples of what kinds of data appears in the positive vs. negative classifications.

Here is a breakdown of a class,

Class
{
Feature Vector 1 { Feature 1, Feature 2, … },
Feature Vector 2 { Feature 1, Feature 2, … },

}

So what is a feature?  A feature is simply a floating point number that has some special meaning as far as the programmer is concerned.  To the SVM however, it’s just a number in a dimension.  So as you can see the inputs for a class are nothing more than a list of vectors filled with floats.

To determine the best best hyperplanes (support vectors) that best divide the data you have to train the SVM with example feature vectors for each class.  The trick is finding the support vectors that maximize the margin (distance) on either side of the hyperplane.  The easier it is to split the data, the less likely you are to see incorrectly classified data when you put the system to use.

Svm_max_sep_hyperplane_with_margin

So let’s talk examples, if I were to say the following feature vector { 0.25 } was the only feature vector in the positive class, and feature vector { 0.75 } was the only feature vector in the negative class, the ideal support vector would be { 0.50 } because it best divides the data, giving the greatest margin between both classification feature vectors.  Which is easy to see when you have only 1 feature and 1 feature vector in either class.  The problem becomes infinitely more difficult when you have classes of 10,000 feature vectors with 600 features in them each being used for training data.

There are several kernel types used to find the best hyperplane through the features. Linear, Radial Basis Function (RBF), Sigmoid, and Polynomial.  From the papers I’ve read, RBF seems to be quite popular.

Next Time

For the first post I just wanted to introduce the basic vocabulary.  Next time I’ll get into actually using them in practice.

Redneck Cloud Computing

Every now and then I wonder what decisions might be different if I had access to a cloud of machines dedicated to baking game assets or solving other highly parallelizable tasks. So I started looking into what options were available to someone wanting to distribute a ton of work over many machines. I found lots of options but all of them suffered from one or more of these problems,

  • Specialized language
  • Specialized framework (tied to language)
  • Specialized operating system
  • New versions would require manual deployment
  • Non-commercial license

I wanted a solution that didn’t require a specialized framework or language because chances are I would find something I wanted to distribute that I didn’t want to completely rewrite. No specialized OS, I want to be able to slave any unused windows machine in the office (dedicated farms are really expensive). Also if I need to perform new operations or fix a bug, I don’t want to reinstall / deploy new versions to everyone’s machine.

So I decided to roll my own solution and share the design. I’ll share the finished version of the software once I’ve completed it.

Let’s start with a usecase. I have an executable and a pile of data on disk that I would like to distribute over a ton of machines. I was able to quickly modify the existing program to have a command line option that allowed it to process a range of the data on disk instead of all of it.  How do I distribute this work over multiple machines so that I process all the data on disk?

Each time we execute the program with some amount of data to be processed, let’s call that a task. Each task will be processed by some machine in the cloud. To submit the tasks, we will need some common / cross language mechanism of communication. I chose named pipes just because they are easy to use on windows.

To submit tasks to the named pipe, we could either write a simple reusable program that reads task descriptions from a file and submits them to the named pipe, or we could wrap the named pipe communication in a C++ library so that C++ and C# (P-Invoke) could both reuse the logic inside of many various tools (including the generic reusable program that reads task descriptions from files or std::in).

Each machine has a single server running on it. I chose to write the server in C#, but it’s a blackbox as far as your tasks are concerned so you could use something different if you wanted.

Once the server receives the task description via the named pipe it looks at the list of servers it knows about.  The serves are detected using a simple UDP broadcast.

Using .Net remoting I then connect to each one of these machines to see what it can offer me.

Now each of these machines could have who knows what installed on them, and you’re about to transfer and run an exe that may have requirements, like .Net 4.0. So each task needs to contain a list of requirements. For now, I’ve got them scoped to 3 things, defined environment variables, registry key/values and .Net version. You could probably drop the .Net version if you just always keep your server written using the latest version so it’s a prerequisite for any machine on your network.

Now I can try and reserve a slot on the machine, if I fail to reserve a slot because of a race condition I move onto the next server.

Having reserved a slot (some machines that are idle or dedicated may have multiple slots), I need to transfer the executable and all the data files listed inside the task description that are claimed to be needed by the task.

I tell the remote server to then begin running the task in a new thread and I continue looking for empty slots in the cloud to submit tasks to.

After I discover that a task has finished, everything that is different in the remote folder where the task was dumped and subsequently executed is then transferred back to the local machine that submitted the task.

The named pipe that was used to submit one or more tasks remains open the whole time and is notified after each task is finished and the data successfully transferred back to the host machine.

So that’s the design in a nutshell. It’s not a solution designed to solve every distributed computing problem, but I like that it solves a very common pattern of problems I see fairly often in a non-intrusive manner.

I haven’t nailed down yet the best way to prevent the system from being abused by a malicious user.  However, I suspect having a SQL server with the list of MD5 hashes of the executables that are approved for deployment is one idea I’ve been toying with.

Here’s a simple diagram to help explain the task submission process. Because who doesn’t love diagrams.

image