« Posts tagged Kinect

Creating Kinect Controls for Angry Birds

Before I begin let me give you a little background.  About a month ago we decided to try and create NUI (Natural User Interface) or Kinect controls for Angry Birds.  Partially as a learning experience but also as proof of concept that games like Angry Birds (ex. applications that lend themselves to touchscreen devices) can work in a gestural / depth camera environment better than has been demonstrated.  In the end we tried many different input methods that would be possible with a Kinect and I wanted to catalog that experience.

I’ll start with the worst input paradigm we tried go from there.

#5 The Faux Mouse

The Idea
First we just tried mapping the hands to the screen as if they were each a mouse or finger.  Each would control a hand like cursor and move it around the screen.  Clicks could either be performed by grasping the hands or pushing towards the screen.  The game would be played normally – except you’d click on everything using your hand.

The Reality
This input method was by far the worst.  Your hands are simply not mice, they get tired much faster and are not as dexterous.  Even with heavy filtering and snap-able buttons, the interaction is just too nuanced and motor skill intensive an operation to click on a button or object on the screen using your hands as mice.  The time it took us to play a level vs. someone playing it on the iPhone was an order of magnitude or longer a task.

When pushing towards the screen a lot of care had to be taken to deal with user drift.  When a user pushes towards the screen, they may in fact be pushing towards many possible locations.  They may drift towards the TV, or the camera if that’s their focus.  They may also consciously attempt to maintain a straight hand as they push forward.  No matter the case – a user will drift even off their intended target.

When we tried grasp detection using computer vision – again we saw drift.  When a user opens and closes their hand the volume of the hand is changed from the point of view of the camera.  The result ends up displacing the center of mass of the hand causing it to shift downward when the hand is closed.  There are solutions to this problem such as eliminating the fingers from the hand volume calculation but this is a difficult problem given the quality of the data.

In the end the drift issues prevented us from playing as well as we could with the real mouse or on the iPhone and since Angry Birds is both a game of quick victory and defeat as well as a game of precision.  Having both large problems with precision and game play progression speed – we ditched this idea.

#4 Voice

The Idea
This one is exactly what you might expect – firing the bird using your voice.  Either by saying “Fire!” or maybe… “Ca Cah!”.

The Reality
We really were not sure how this method would perform.  The voice portion was layered onto the mouse control system.  Essentially you would click on the slingshot – move your hand to aim and exclaim viciously to let loose the birds of war.  The hope was that it would alleviate the drift problems, which it did.  Without having to push your hand forward to fire the drift was eliminated from firing.

However there were other problems, there was delay in the speech recognition and often outright failure if you didn’t say the words just right.  We tried several tricks, like using multiple words to identify “Fire”.  One good way to generate that list is to take the top 10 words it mistakened someone saying fire for and add those to the dictionary as triggers for firing the bird.

Moving Away From Faux Mouse

After the failures with both #4 and #5 we went back to the drawing board.  We needed to get away from the mouse or touch device centered thinking.  The worlds are simply too different to treat it the same.  So we prototyped 3 other solutions that ended up all being better than the mouse style interface.

The ideas all stem from an understanding that when you go to port an application from a touch enabled world you need to think about 2 things primarily,

  • Context
  • Automation

Context – How can you reduce and scope the options to the user so that a broad array of options can be presented to the user – but with only a few usable at any given time with a small vocabulary of motions.

An example from Angry Birds is all the options the user can perform in the game:

  • Fire Bird
  • Activate Bird Special Attack
  • Restart
  • Pan
  • Zoom
  • Return to the Menu

We needed to find a way to contextualize these options when moving away from a mouse driven style interface.

Automation – Find the items in the application that everyone does without thinking about it and automate them.  If they aren’t relevant to game play find a way to make them irrelevant in a NUI application.

An example from Angry Birds is activating the slingshot.  You probably don’t think about it when playing the game but to fire a bird you must first place your finger on the slingshot before drawing back the bird to fire it.  While this is unbelievably trivial to the point of not thinking about it on an iPhone, it’s a huge pain in an environment where you have to get a virtual hand cursor over it, even more so if you then need to push forward to activate it.

So we needed to find a way to automate clicking the slingshot.  That way instead of clicking the slingshot explicitly it would be implicitly activated by performing some gesture to begin the act of firing a bird, that would be disconnected from the onscreen location of the hand relative to the slingshot on the screen.

#3 Arclight

The Idea
You would draw back the slingshot by bringing your hands together.  Then bring your hands apart and rotate them around your center of mass to change the firing angle on the slingshot.  Then once you’ve settled on the firing angle, bring your hands together fast to trigger the fire.

The Reality
The problem with this kind of activation of the slingshot was the drift when the hands come together.  This can be partially accounted for but it’s heuristically based and can be erroneous.  Additionally activating the bird’s special power was difficult.  You would have to choose a different kind of interaction to activate the special power which would complicate the process.

#2 Stretch and Snap

The Idea
This idea grew out of the Arclight firing system.  To attempt to solve the problem of drift, have the slingshot fire as soon as the arms reach a certain distance apart.

The Reality
With this firing system you still have the problem of determining how to activate the bird special power.  You also introduce a new problem – all birds are fired at maximum drawback.  You also need to make sure to provide the user with feedback so that they know how close the user is to the *snap*, some kind of progress firing bar.

#1 Axis Separated

The Idea
For this idea I separated the functions of the hands into distinct responsibilities.  Your left hand activates the slingshot by pushing forward (doesn’t matter where).  After a threshold is crossed the slingshot is activated, from then on an angle is calculated between the shoulder location and the left hand’s location relative to it, to produce the slingshot firing angle.  To fire the bird the right hand is pushed forward and pulled back, this sends the bird flying.  To activate the bird’s special power you again push the right hand forward.

The Reality
This method ended up working perfectly.  It doesn’t result in any drift when firing is activated. It is also easy to perform because all the motions can be performed with your arms down by your sides, reducing exhaustion in long game play sessions.

Lessons Learned

You hear it all the time but it is critically important to prototype ideas when it comes to creating Kinect controls.  They simply don’t work as well as you would like in reality as they do in your head. Here’s a demonstration video of the end result,

YouTube Preview Image

Kinect, Anthropometry and You

Anthropometry (Greek anthropos (άνθρωπος – "man") and metron (μέτρον – "measure") therefore "measurement of man") refers to the measurement of the human individual.

Ask any Kinect developer what the hardest problem is developing a game or application that uses the Kinect – or any other depth camera.  The answer you’ll get most often will be creating something that works well for 95% of target users.

This is something you have to consider for all your gestures, poses, and UI interaction.

  • Is this gesture too difficult for your average user?
  • Does this pose require too much flexibility?
  • Is this UI interaction comfortable and easy to perform?

One area that can benefit from anthropometry data is UI interaction.  When you think about UI interaction with Kinect you’ve got to picture it as a real world space (box, sphere, cylinder, other) located somewhere around the body that you’re mapping the hand position in that space to the 2D or 3D UI coordinate plane.

When determining the size and location of this real world space and how it maps user hand locations onto the UI coordinate system the largest question you need to consider is:

Where will they be most comfortable?

Generally speaking you want a space that minimizes upper arm movement – as that is much more strenuous compared to forearm movements.

However, since the 3d position we’re mapping into our 2d/3d UI coordinate system varies based upon user skeleton size we can’t choose a single set of real world dimensions that will work for everyone.  We’ll have to make educated guesses about the size and location of our real world UI coordinate frame based on size and location of the users bones.

So how does anthropometry fit in?

Because the skeleton you get from Kinect – and other SDKs can be unreliable in certain poses you often find yourself heavily filtering any kind of data you’re tracking about the user.  Especially things like the user’s arm length – which can vary dramatically over a session.

So one thing I prefer to do is use anthropometry tables to ensure a more consistent size and location and doesn’t fluctuate as much as the user’s skeleton.  Using anthropometry tables we can estimate the users arm length or hand size based on other bones in their body, bones that are more stable in your skeleton SDK of choice (Kinect, OpenNI, Iisu, Omek…etc).

You can also use anthropometry tables to estimate the size of body parts that the skeleton SDK you’re using doesn’t provide – such as the size of the users hand.

But where do you find that kind of anthropometry data?

Luckily such a resource has already been painstakingly cataloged for us by the FAA – The Human Factors Design Guide.  The HFDG was put together so that planes could be constructed so that almost anyone would fit and be able to operate anything from their seat.

The anthropometry data that’s valuable to us starts in chapter 14, page 791.  For example, these lovely tables from page 818 show the functional reach and the extended functional reach of men and women broken down by population percentiles.

hfdg_reach

Trip Report: Gamefest 2011 – Seattle

I managed make it out to Seattle this year for Gamefest and figured I’d share my thoughts on some of the different presentations I saw. They are not available yet, but it looks like Microsoft is going to be posting the slides/audio for the different presentations here soon.

Tiled Resources for Xbox 360 and Direct3D 11 – Matt Lee

This talk was about mega-texturing in DirectX 11/Xbox 360.  Matt Lee was showing a new DirectX SDK sample that’s coming in the next SDK release giving a reference implementation of a mega-texturing run-time.  I’ve only skimmed mega-texturing papers so I got a lot out of this talk since he walked through all the steps in the run-time.

The sample shows off how you begin by creating different tiles for different resource formats.  Each pool is dedicated to a different texture format.  The tiles in the pool are all the same size; However the tiles may vary in size depending upon the texture format to maximize cache efficiency.  When you render the scene you have a shader that can write out texture look-up failures.  When the UV coordinates and mip level are not found to be resident in memory a failure is added to this list.  After the shader completes you read back the failures and proceed to load the tiles that will fit in your established pools.

Unlike most texture streaming systems you’re not loading an entire mip level or the entire mip chain of the texture.  You’re only ever loading into the tiles a sub-region of a texture (like a 64×64 pixel region), which overcomes one common texture streaming problem, texture memory fragmentation.  Because the tile pools you create are never deallocated you don’t have to worry about fragmenting your texture memory because of different sized textures being streamed in and out.

Now the sample is not without its short comings, but that is mostly due to hardware limitations.  Ideally the virtual texture system would be transparent, you wouldn’t need to write a shader that recorded look-up failures.  The GPU and DirectX would simply report when a failure occurred and allow you to handle it.  Maybe some day…

Gesture Detection Using Machine Learning – Claude Marais

If you have ever been interested in machine learning this is a worthwhile presentation to check out when the slides are posted.  Claude Marais talked about a case study they performed to try and use machine learning to detect a Punch and a Kick.  For their experiment they used Adaboost which is a machine learning technique that combines thousands of weak classifiers that ‘boost’ each other and provide you with a high degree of accuracy in the results.

The classifiers are all extremely simple things, for example you may have a classifier like:

C++
if (elbow_joint_angle > ANGLE)
    return 1;
return -1;

Then simply create a macro and have 180 variants of this classifier one for each ANGLE.  If you can imagine all the different things you could measure about the skeleton, creating simple variants of the kernels for each of the possible test cases will explode the number of weak classifiers you have; Claude had around 21,000 weak classifiers for his system.

The training phase looks at labeled data sets to know what examples of punches look like (positive examples) and what -not- punches look like (negative examples).  It uses the +1/-1 scores each weak classifier provides to determine the weights to apply to each classifier.  After it has determined the best weak classifiers to detect a punch and not detect a negative example as a punch on accident you can use the classifiers at run-time with the weights applied to detect a punch.

The results were undeniable; they had a demo setup the the expo area that was really good at detecting a punch and kick.

The only real drawback to this solution is the data collection; they needed something on the order of 70,000 examples of punches and 7x that in negative not a punch examples before the training produced accuracies over 90-95% from the chart they had; if my memory is correct.

In training the system they had 70,000 frames worth of recorded training data.  The actual number of recorded punches used to train the system was 25 different people doing 10 punches, so around 250 punch examples.  Then they had about 7x that number in negative training examples, which might be things like waves, or other actions that SVM can use to differentiate between random movement and an intentional punch. (Thanks to Claude for clarifying this)

Kinect and Kids: Pitfalls and Pleasantries – Deborah Hendersen

If you had asked me to make a Kinect game for kids (ages 3-6) before seeing this presentation I likely would’ve designed something with a dumb-me as the target audience.  What I quickly realized is how wrong I would’ve been to make that assumption.  At that stage of development kids are not capable of interacting with games I’m used to playing.

Something as simple as a menu of options is an impossibility since they are illiterate.  How many games have you seen that you could play without knowing how to read?

When interacting with an onscreen character, the kids ignore social norms of waiting for the person to finish talking.  They may just jump the gun if they already know what is expected and get frustrated if they can’t do it when they want to.

Kids are distracted very easily and will make their own games out of game behavior.  Deborah mentioned one story where a kid stopped playing the game because he realized he could get the game to react to leaving the play area and Kinect could no longer detect him the game would do something.  So he made up his own game of jumping in and out of the play area to activate this condition; utterly boring for adults, completely entertaining for this kid.

You almost have to design the game like passive experience like a children’s TV show.  Where on TV because there is no feedback, the TV show host asks the kid, “Can you find _______?” and the kid at home says something, and expecting this the show simply pauses while he waits for the response.  The game has to function in essentially the same way, regardless of the kid participating in the expected fashion the game has to move forward.  If it functions like a state machine that requires proper actions to move forward the kid may become bored and simply want to move on.  If the game refuses to let them move on, they’ll just walk away.

I really enjoyed this presentation because it was very clear how difficult the problem space is and it was interesting to hear how they tried to solve each one.

Kinect Hands: Finger Tracking and Voxel UI - Abdulwajid Mohamed and Tony Ambrus

This presentation was broken into two completely different parts, the first part was on finger tracking with Kinect.  This is one area I’ve been playing around in for awhile so it was interesting to see someone else’s attempt to solve the problem.  Because the Kinect is a structured light depth camera you don’t necessarily have depth at each pixel like you would on a time of flight depth camera.  Structured light cameras build a topology of depth using the light pattern it projects into the scene, viewed from a different angle it can discern depth, but a single dot does not give you depth.  It connects groups of them when determining the depth of a surface.  This means that even though your hand can be seen by Kinect, the further you back away from the sensor, the more like a mitten your hand becomes.  The gaps between your fingers disappear until they are just clumps on your wrist.

Because of this limitation you can’t go past 10 feet, there simply isn’t enough data.  Ideally the user is at 6 feet or closer, past 6 feet the accuracy begins to break down.

The way Microsoft tackled the problem was to first capture lots of hand examples and then to train an SVM (Support Vector Machine) against a curvature analysis of the hands.  So once you know all the pixels that make up a persons hand you find the points on the hand that result in the largest changes in curvature.  On an open hand these curves are your fingers and if you’re close enough to the camera that it can see the gaps between fingers it’s a very large change in curvature.  A closed hand has more or less a uniform curvature change viewed from any angle.  By training the SVM against a set of closed hand curvature examples vs. open hand curvature examples they were able to get pretty accurate results at about the 6-8 foot range for an adult, 5-7 feet for kids.

Because the detector is instantaneous i.e. it can tell you in a single frame is the hand open or closed, you need some something to counteract a single/couple misinterpreted frame.  So they trained an HMM (Hidden Markov Model) on examples of a flaky transition where the system is quickly switching between 2 states because the hand is at an odd orientation confusing the SVM; I thought it was an interesting solution to the problem.  I’ve only ever tried something simple like requiring 3 contiguous frames of agreement to have a state change.

The second half of the presentation was on a 3D (not stereoscopic) UI for Kinect.  One of the problems with navigating a ‘push to click’ interface is that it’s hard to correct for user drift.  When a user pushes forward they may do several things,

  • Push toward the TV
  • Push toward the sensor
  • Push forward (wherever forward happens to be at that moment in time)

Depending upon what you’re expecting them to do there’s going to be drift away from the thing on the screen they are trying to click.  To attempt to correct this Abdulwajid presented a UI where the hands are visualized as voxelized clumps of boxes in a 3D environment with 3D buttons that could be mashed.  Seeing the hand in the same space as the button appeared to make it much easier to perform the click.

One thing I noticed that was not called out was his use of 2 directional shadow casting lights.  By having 2 directional lights facing each other both casting shadows, the resulting effect is a focal point.  As the hand gets closer to a surface the eye perceives the two shadows heading towards each other and can see the point where they will meet.  I thought that was and additional powerful indicator of where your hand was moving in the space and made it much easier to correct drift.

Project Photofly Experiment

Last week we were sitting around the office wondering if it would be possible to place ourselves in a game world with Autodesk’s Project Photofly.  How cool would that be?  We thought we might be able to scan one of us in a T-pose and then use Mixamo’s Auto-Rigging tool to create a rigged avatar.  Then we could be running around a level in front of our Kinect as ourselves.

Sadly it never went past stage one.  It’s harder than you might think to hold a T-pose for 3 minutes while someone circles you twice snapping pictures at 10 degree intervals.

I don’t have any pictures of the results; I came out looking like the elephant man.  We’ll probably try again at some point, but in the meantime I made another scan of a pair of static objects that was turning out pretty good until I got to the back of the monkey.

YouTube Preview Image

All in all Autodesk’s Photofly software is pretty cool.  It’s still lacking in the area of iteration and debugging.  You can try manually tagging photo matchup points between images to give it a better idea on how the images fit together but it takes awhile for the data to be processed in the cloud.  It’s also unclear where some data comes from, or why portions of the background become part of the foreground mesh.  If it had better feedback for how that data became part of the mesh cleaning up the results would be a lot easier.

Also, if you own a camera with a sports video mode that captures at 60 FPS you can just slowly circle the subject and then dump all the frames using ImageGrab.  Which is way easier than snapping individual pictures.

I wonder if I could generate 3d art for a game jam…

Intelligent Character Motion 1.1 – With Unity Integration

Back in April I wrote a short post about Activate3D releasing 1.0 of ICM, but unlike the 0.8 alpha version the 1.0 did not ship with a community version.

Well today I’m happy to say we’ve released a new 1.1 version with a community version along with a Unity (Free or Pro) integration.

The 0.8 version of the product was much harder to pickup and play with because there wasn’t a level editor.  With the Unity integration that problem has been greatly alleviated.  Users can now drag and drop features into their level to create a world they can explore and interact with using their Kinect and OpenNI.

YouTube Preview Image

We’ve tried to expose a lot of the functionality to the GUI layer in Unity.  For the things you’re unable to do through the GUI, we’ve exposed a great deal of our API to .Net.  The 1.1 community edition also includes our native and managed binding layer code so that if you need to expose additional things or need to do something only available in our native C++ API you can take advantage of our existing SWIG code to wrap your new functionality, instead of writing your own wrapper layer or SWIG interface from scratch.

Download ICM 1.1 Community Edition

The new 1.1 version of ICM solves many of the problems OpenNI users encounter and more. All of these ICM features are usable out of the box with a couple of mouse clicks,

  • Skeleton retargeting
  • Skeleton stabilization
  • Gesture/Pose detection
  • Several grasping solutions
  • Refined hand position for NUI GUIs
  • Engagement/Disengagement with physical objects
  • Physical object collision
  • Feet planting
  • Avatar physical simulation
  • Sample intractable features
    • Bars/Poles
    • Ropes
    • Floors/Walls
    • Water
    • Ledges
    • Dynamic Box/Sphere
    • Jump Paths
    • Triggers
    • Zipline