« Posts tagged NUI

Creating Kinect Controls for Angry Birds

Before I begin let me give you a little background.  About a month ago we decided to try and create NUI (Natural User Interface) or Kinect controls for Angry Birds.  Partially as a learning experience but also as proof of concept that games like Angry Birds (ex. applications that lend themselves to touchscreen devices) can work in a gestural / depth camera environment better than has been demonstrated.  In the end we tried many different input methods that would be possible with a Kinect and I wanted to catalog that experience.

I’ll start with the worst input paradigm we tried go from there.

#5 The Faux Mouse

The Idea
First we just tried mapping the hands to the screen as if they were each a mouse or finger.  Each would control a hand like cursor and move it around the screen.  Clicks could either be performed by grasping the hands or pushing towards the screen.  The game would be played normally – except you’d click on everything using your hand.

The Reality
This input method was by far the worst.  Your hands are simply not mice, they get tired much faster and are not as dexterous.  Even with heavy filtering and snap-able buttons, the interaction is just too nuanced and motor skill intensive an operation to click on a button or object on the screen using your hands as mice.  The time it took us to play a level vs. someone playing it on the iPhone was an order of magnitude or longer a task.

When pushing towards the screen a lot of care had to be taken to deal with user drift.  When a user pushes towards the screen, they may in fact be pushing towards many possible locations.  They may drift towards the TV, or the camera if that’s their focus.  They may also consciously attempt to maintain a straight hand as they push forward.  No matter the case – a user will drift even off their intended target.

When we tried grasp detection using computer vision – again we saw drift.  When a user opens and closes their hand the volume of the hand is changed from the point of view of the camera.  The result ends up displacing the center of mass of the hand causing it to shift downward when the hand is closed.  There are solutions to this problem such as eliminating the fingers from the hand volume calculation but this is a difficult problem given the quality of the data.

In the end the drift issues prevented us from playing as well as we could with the real mouse or on the iPhone and since Angry Birds is both a game of quick victory and defeat as well as a game of precision.  Having both large problems with precision and game play progression speed – we ditched this idea.

#4 Voice

The Idea
This one is exactly what you might expect – firing the bird using your voice.  Either by saying “Fire!” or maybe… “Ca Cah!”.

The Reality
We really were not sure how this method would perform.  The voice portion was layered onto the mouse control system.  Essentially you would click on the slingshot – move your hand to aim and exclaim viciously to let loose the birds of war.  The hope was that it would alleviate the drift problems, which it did.  Without having to push your hand forward to fire the drift was eliminated from firing.

However there were other problems, there was delay in the speech recognition and often outright failure if you didn’t say the words just right.  We tried several tricks, like using multiple words to identify “Fire”.  One good way to generate that list is to take the top 10 words it mistakened someone saying fire for and add those to the dictionary as triggers for firing the bird.

Moving Away From Faux Mouse

After the failures with both #4 and #5 we went back to the drawing board.  We needed to get away from the mouse or touch device centered thinking.  The worlds are simply too different to treat it the same.  So we prototyped 3 other solutions that ended up all being better than the mouse style interface.

The ideas all stem from an understanding that when you go to port an application from a touch enabled world you need to think about 2 things primarily,

  • Context
  • Automation

Context – How can you reduce and scope the options to the user so that a broad array of options can be presented to the user – but with only a few usable at any given time with a small vocabulary of motions.

An example from Angry Birds is all the options the user can perform in the game:

  • Fire Bird
  • Activate Bird Special Attack
  • Restart
  • Pan
  • Zoom
  • Return to the Menu

We needed to find a way to contextualize these options when moving away from a mouse driven style interface.

Automation – Find the items in the application that everyone does without thinking about it and automate them.  If they aren’t relevant to game play find a way to make them irrelevant in a NUI application.

An example from Angry Birds is activating the slingshot.  You probably don’t think about it when playing the game but to fire a bird you must first place your finger on the slingshot before drawing back the bird to fire it.  While this is unbelievably trivial to the point of not thinking about it on an iPhone, it’s a huge pain in an environment where you have to get a virtual hand cursor over it, even more so if you then need to push forward to activate it.

So we needed to find a way to automate clicking the slingshot.  That way instead of clicking the slingshot explicitly it would be implicitly activated by performing some gesture to begin the act of firing a bird, that would be disconnected from the onscreen location of the hand relative to the slingshot on the screen.

#3 Arclight

The Idea
You would draw back the slingshot by bringing your hands together.  Then bring your hands apart and rotate them around your center of mass to change the firing angle on the slingshot.  Then once you’ve settled on the firing angle, bring your hands together fast to trigger the fire.

The Reality
The problem with this kind of activation of the slingshot was the drift when the hands come together.  This can be partially accounted for but it’s heuristically based and can be erroneous.  Additionally activating the bird’s special power was difficult.  You would have to choose a different kind of interaction to activate the special power which would complicate the process.

#2 Stretch and Snap

The Idea
This idea grew out of the Arclight firing system.  To attempt to solve the problem of drift, have the slingshot fire as soon as the arms reach a certain distance apart.

The Reality
With this firing system you still have the problem of determining how to activate the bird special power.  You also introduce a new problem – all birds are fired at maximum drawback.  You also need to make sure to provide the user with feedback so that they know how close the user is to the *snap*, some kind of progress firing bar.

#1 Axis Separated

The Idea
For this idea I separated the functions of the hands into distinct responsibilities.  Your left hand activates the slingshot by pushing forward (doesn’t matter where).  After a threshold is crossed the slingshot is activated, from then on an angle is calculated between the shoulder location and the left hand’s location relative to it, to produce the slingshot firing angle.  To fire the bird the right hand is pushed forward and pulled back, this sends the bird flying.  To activate the bird’s special power you again push the right hand forward.

The Reality
This method ended up working perfectly.  It doesn’t result in any drift when firing is activated. It is also easy to perform because all the motions can be performed with your arms down by your sides, reducing exhaustion in long game play sessions.

Lessons Learned

You hear it all the time but it is critically important to prototype ideas when it comes to creating Kinect controls.  They simply don’t work as well as you would like in reality as they do in your head. Here’s a demonstration video of the end result,

YouTube Preview Image

Intelligent Character Motion 1.1 – With Unity Integration

Back in April I wrote a short post about Activate3D releasing 1.0 of ICM, but unlike the 0.8 alpha version the 1.0 did not ship with a community version.

Well today I’m happy to say we’ve released a new 1.1 version with a community version along with a Unity (Free or Pro) integration.

The 0.8 version of the product was much harder to pickup and play with because there wasn’t a level editor.  With the Unity integration that problem has been greatly alleviated.  Users can now drag and drop features into their level to create a world they can explore and interact with using their Kinect and OpenNI.

YouTube Preview Image

We’ve tried to expose a lot of the functionality to the GUI layer in Unity.  For the things you’re unable to do through the GUI, we’ve exposed a great deal of our API to .Net.  The 1.1 community edition also includes our native and managed binding layer code so that if you need to expose additional things or need to do something only available in our native C++ API you can take advantage of our existing SWIG code to wrap your new functionality, instead of writing your own wrapper layer or SWIG interface from scratch.

Download ICM 1.1 Community Edition

The new 1.1 version of ICM solves many of the problems OpenNI users encounter and more. All of these ICM features are usable out of the box with a couple of mouse clicks,

  • Skeleton retargeting
  • Skeleton stabilization
  • Gesture/Pose detection
  • Several grasping solutions
  • Refined hand position for NUI GUIs
  • Engagement/Disengagement with physical objects
  • Physical object collision
  • Feet planting
  • Avatar physical simulation
  • Sample intractable features
    • Bars/Poles
    • Ropes
    • Floors/Walls
    • Water
    • Ledges
    • Dynamic Box/Sphere
    • Jump Paths
    • Triggers
    • Zipline

Do the Truffle Shuffle to Start

Preface

The first time I stepped in front of a depth camera was almost a year ago now.  We had a reference version of a PrimeSense camera that is heavily related to the final hardware that went into Kinect.  The first thing I got to do was make a stick figure guy move around on the screen.  It was very captivating to see him match my movements, even with the occasional arm through my chest Kali-Ma style.

Those first days were filled with lots of experimentation because everything was new to us in this world of full body motion gaming.  Which reminds me…if you ever want to see a cool effect, go grab a large mirror and hold it in front of a depth camera at an angle; now you’re really playing with portals!

Introduction

With all the time I’ve spent around these cameras I wanted to capture some thoughts on some problems developing games and software driven by full body motion input.

Unnatural User Input

Lately I’ve come to find the statement “Natural User Input” a bit of a misnomer.  There are still many technological and human hurdles that have to be overcome with time and good ideas before the interaction is truly natural.  The problem with natural is that it’s different for everyone, which generally forces you to make it unnatural for some group of people.  Also with the limitations of the current technology you will often find yourself making unnatural concessions to make something work.

A great example of this is getting detected by the camera, often referred to in the office as “Doing the Truffle Shuffle”.  Some skeleton SDKs require a pose or gesture to be detected as an active user.  For example, OpenNI has the “Psi” pose.  Some ask you to wave your hand.  Some just work, like Kinect but even so many games have logic layered on top about when a user can join that is highly varied and currently unnatural because there isn’t one consistent way yet.

Another good example of this is turning.  If I asked you to turn, how would you do it?

Q: Would you naturally turn, away from the TV?
A: No, then you couldn’t see the TV.

Q: So would you turn your whole body and continue to face the TV, or just your shoulders?
A: If you turn your body naturally you’ll occlude half of your body, making it harder to detect other actions simultaneously (Walk + Turn).  Also many skeleton SDKs have varying levels of success tracking shoulder angle and occlusion of the shoulder usually causes them to move around.

Q: What about if we let the hands determine turning, moving them left to move left, right to move right?
A: Good in some contexts, like skiing and horseback riding.  It’s very unnatural when walking around. It also prevents you from using the hands to do other things at the same time.  It’s also very hard to hold for long periods of time if you have to keep them there.

Q: How about leaning left to turn left, leaning right to turn right?
A: It’s great from a technological standpoint. It won’t ever occlude any part of the body.  Very easy to do for all users.  Very easy to hold for long durations.  Can be combined with many other actions.  However, it’s completely unnatural.

The best advice I can give here is to get people to test out your ideas.  I can’t tell you how many times I’ve thought to have solved a problem only to see a tester or coworker break it almost immediately.  If you can help it, find new people to try out the game.  We refer to them as untrained users around the office.  For these systems you’ll find that over time the system trains you back.  You learn just the right movements without thinking about it, which will lead to a false sense of improvement in your gesture detection code.

I haven’t seen it happen yet, but I suspect many motion games in the future will actually ship with multiple ways of handling the same input and users will select the one they prefer.  In the same way we have inverted controls and different control schemes.

Noise

The cameras are not perfect and they’re mapping a physical space to some finite number of pixels.  Surfaces that poorly reflect infrared, other infrared sources (like the sun) and even the manner in which the cameras define a contiguous surface can cause variations from frame to frame leading to lots of jaggy shifting edges on objects.  This jitteriness influences the volume of an object and thus the calculated positions of bones in a skeleton are shifting too.

So you’ve got to find a way to smooth out the data without adding lag to the propagation of player movements onto the character. The best way we’ve found is with a predictive filter.  They average in old frames with the current frames data, but are simultaneously predicting N (usually 1) frames forward in time.  The only drawback is they end up over and undershooting the actual curve of motion because it’s predicting the motion is going to continue in the same direction.  Luckily this largely goes unnoticed by users.

Generally Avoid

The amount you should avoid each of these varies across cameras and skeleton SDKs, but generally speaking this is my own list of things you should try to avoid.

  • Small Motions – Detecting them is very difficult, they are very easy to confuse with noise.
  • Holding hard poses – It’s hard to hold your arms out for extended periods of time.
  • Motions near the body – Occlusion problems, bone loss.
  • Fast motions – Most of the consumer grade depth cameras right now are running at 30 FPS.  It’s very easy to move faster than the segmentation / skeleton prediction code is willing to bet you’ve moved and will happily ignore your motion.
  • Extreme poses – Poses most people would have trouble making.  Not just because people have trouble making them, but because most of the skeleton SDKs are not trained for unusual body positions.
  • Sitting – It’s is generally not handled well across skeleton SDKs.  The overall skeleton becomes a lot less trustworthy.

That’s Normal Right?

All the skeleton SDKs I’ve used so far don’t generally return you anything other than the rawest of the raw bone positions.  Which is generally a good thing; you wouldn’t want them to hide the raw data from you.  However, this will tends to result in moments when your hand will penetrate your chest, your knee will flip backwards and you’ll have your leg behind your back.

So it becomes important to try and avoid these events by using joint constraints.  Even though the skeleton SDKs usually have bone confidence numbers, they’re not comparing confidence based upon how a normal human can move.  It’s based on can they clearly see something they think is a body part.  If so they will report things like, 100% confident your leg has driven itself up into your chest.

Time

Timing is very difficult.  The user has to predict how long he is going to take to move, while at the same time accounting for how long the avatar will take to move, plus how long the gesture detection will take to detect his action.  Making it very hard for him to predict when he has to jump or duck or move to the side.

In these situations feedback that he has done the right thing, as well as how long he has left to do the right thing can be important.  One handy trick when compressing timespans to play back animation is Bullet-Time.  Imagine a player running and jumping hurdles.   There’s this unknown zone that once entered there will not be enough time to playback the animation to jump the hurdle without it looking sped up bizarrely fast.  However with bullet time, if you detect the gesture just in time, you can slow down time long enough to play back the animation and also indicate to the user, “Hey, you almost missed that one”.  Bullet-Time is also handy for just giving the user more time to make a split second decision, and then as soon as they’ve made it, speed back up.

Just a Little Bit Closer…

Depth perception is another frustrating problem.  Users have really poor perception about how far away objects are from their avatar that they can interact with.  Luckily there are many ways around this problem.

  • Depth Cues – shadows do a great job of helping to show distance as you get closer to an object
  • UI Visual Cues – Visual feedback that you can now interact with the object is important.  If I’m playing a volleyball game, changing the halo around the ball from red to green to indicate I can now jump and hit it can be valuable feedback, because it’s hard estimating how high my character can jump, or when they can jump.
  • Camera angle is everything.  Having the right angle to the object can make it much easier to tell depth.
  • Audio Cues – I don’t see these get used very often, but sound is a great way to indicate action is required, or success or failure on the user’s part.