Gesture Recognition 1 -- The Back Story

Hi,

[advAPOLOGIESance for the length of the post -- but there seems to be a lot of ambiguity in this field so I try to define terms that I use. I'll post this in two parts: this first part outlining the problem domain and the second describing the implementation]
[[ *PLEASE* elide judiciously when replying. None of us need to read this entire missive again just because you're too lazy to edit it for your reply :< ]]

I'm tweaking a gesture recognizer that I wrote and still not happy with the results.

In this context, "gesture recognizer" is akin to "pen recognizer" (though not entirely). Specifically, the gesture is (currently) "issued" by the fingertip (on a single-point touch pad) without the use of a stylus, etc. In the future, this may migrate to a camera or accelerometer based recognizer.

It's an "on-line" recognizer so it has access to the temporal aspects of the gesture (vs. an off-line recognizer that only sees a static "afterimage"). I.e., I can "watch" the gesture as it is being "issued".

The key point(s) to take away are:

- no need for explicit "training" (user-independent)

- rotation/scale invariant (subject to the gesture set)

- "real-time" interaction

- it's not (conventional) "writing" that is taking place

- a single point is traced through space

- no "afterimages" of gestures are present (i.e., no "ink")

The last point bears further emphasis: the user has no obvious means of reviewing the gesture issued. E.g., with a pen interface, you can "see" what you "wrote". More importantly, you can see what the MACHINE thinks you wrote! (i.e., if you *think* you drew a 'O' but the resulting "image" looks more like a 'C', then you know the machine didn't "see" what you intended it to see! So, if it ends up doing something "unexpected", you know *why*!)

Finally, the interface is used to issued *commands*, not "data entry" (though this is a small fib). As such, the user has no opportunity to confirm/deny the results of the recognizer before they are acted upon. E.g., in a pen interface, if something is misrecognized, *you* can detect that and discard/abort the entry. By contrast, here, the entry is *acted upon* as soon as it is recognized!

In *practice*, to further clarify these issues, my prototype is a small touchpad (~3" dia) mounted on the *lapel* of a jacket. As such, the user can't "see" the gesture he is issuing. Nor any "afterimage" (had their been "ink" involved). Nor would it be a particularly convenient way of "writing".

OTOH, it only requires *one* hand for operation *and* leaves your eyes free for other activities (IMO, two of the stupidest aspects of Apple's products are that they require *two* hands

*and* two EYES to operate! Try using one while running down the street or SAFELY driving a car! :> )

Recognizing a small set of gestures is relatively easy. But, as the gesture set increases, the potential for ambiguities quickly increases.

I've designed my gestures to try to minimize this. And, the user interface has been designed with awareness of the gesture input mechanism and its needs/constraints. For example, the UI constrains the range of valid inputs at any given time so that the gesture recognizer need not be required to recognize the entire range of "potential" gestures at all times. (This is also a profound efficiency hack) It also tries to avoid having similar gestures in the same input set (e.g., 'O' vs 0') to increase the "distance" between candidates. (Eventually, I would like to provide a run-time mechanism that lets the application evaluate the "orthogonality" of the input set on-the-fly). The emphasis in this approach is to dynamically trade complexity for speed/reliability.

For example, when used for data entry (the "fib" alluded to above), the size of the gesture set increases dramatically. E.g., even simple numeric entry requires a dozen "extra" gestures! OTOH, data entry tasks tend to be more "focused" than command oriented activities -- so, the user can be expected to be more careful in issuing those gestures. Also, data entry has an expectation of "review" prior to acceptance -- unlike "commands" that you don't want to constantly be prompting the user for confirmation: "I have detected the EMERGENCY STOP gesture. Do you really mean to affect an emergency stop? Ooops! Never mind..."

As an (insane) example of the potential application for such an interface, one *should* be able to drive a *car* using it!

[*Think* about the consequences of that. You surely don't want to be requiring confirmation of every issued gesture -- the interface would be *way* too clumsy AND sluggish! You don't want the "driver" watching a display for feedback to see what the recognizer's decisions are (or, if it might have "missed" part of a gesture). And, you surely don't want the recognizer to misrecognize "turn left" for "come to a complete stop". Yet, if the driver can spare the attention, he should be able to "enter" a speed setting -- "45" -- for the cruise control; or, dial a radio station; etc.]

Of course, the other big advantage of such an interface is that it can be "wide" (support lots of virtual buttons) as well as configurable -- in a small size/cost.

Gesture Recognition 1 -- The Back Story

Join the Discussion

Didn't find your answer?