Gesture Recognition (2 of 2)

- D
- Don Y
  
  Contact options for registered users
posted
12 years ago

Sat, Nov 26, 2011 1:49 PM

[Same comments from part 1 apply]

I'll skip over the chronology behind my current implementation. Assume I've abandoned/refined various approaches because they couldn't provide "acceptable" performance.

Rubine-style (Rubinesque ;-) ) recognizers are speedy and simple to implement. But, they seem to produce bizarre results with an unforgivably high frequency! They seem to fall below the "Make everything as simple as possible, but no simpler" yardstick. I spent a *lot* of time reviewing raw data trying to locate elusive "bugs" in that recognizer only to discover that it was correctly misrecognizing that data!

[I suspect a problem here is that the feature sets distill too much from the gestures -- esp much of the temporal relationships -- and leave too many potential unresolved conflicts in the data set. So, unexpected characteristics of the input have disproportionate influence on the results]

DTW recognizers are incredibly expensive. And, enforcing realistic constraints on the point matching/warping is rocket science/mysticism. You can't "prove" that the criteria you have adopted are "correct" -- just that they

*seem* to improve performance FOR YOUR PARTICULAR DATA AND GESTURE SET!

OTOH, both of the above benefit from the fact that as the size of the raw data increases ("long"/complex gestures and/or slowly drawn), there is more "real time" available for their recognition (because a gesture can't be expected to be recognizable until it has been completely issued!)

[OTOOH, the increased size of that data set means you had better be processing whatever you can *as* you collect it else you will be taxed trying to crunch it all in the milliseconds following the gesture's issuance!]

To make things affordable (esp to hit a ~100ms recognition window), you have to seriously reduce the data that you have to process. OTOH, you want enough incoming data that will support fine detail in digitizing the motion as well as supporting a wide range of "drawing speeds".

The first win is to design the recognizer such that it is constrained as to the set of valid gestures that it must consider at any given time. This reduces the number of potential match candidates that must be (numerically) evaluated. [It does so at the risk of being "confused" by a "syntax error" -- a gesture issued illegally!]

So, I pass a set of legal gestures to the recognizer prior to each possible gesture issuance. Since one gesture's (successful?) recognition can define the set of valid subsequent gestures, this means I must be able to recognize gesture #1, pass it to the UI state machine and determine the set of valid subsequent gestures *prior* to gesture #2 being STARTED.

[I could conceivably support a form of "type-ahead" but it would be impossible to design with deterministic support for that potential overloading as you wouldn't be able to predict if future processing requirements would INCREASE beyond what you are currently UNABLE to handle! So, best plan on handling things as they happen]

A key aspect of this gesture set is the omnipresence of a "none-of-the-above" result. The recognizer can't assume that the input data set *is* one of these gestures (i.e., it's not a question of "pick the best fit" but, rather, pick the *likely* fit *if* likely)

For each possible gesture result, I note key characteristics of that "model/template" gesture. These include a formal definition of the gesture (i.e., a mathematical representation of the "path" traversed in its issuance); the "event" that should be signalled on its recognition; constraints placed on its recognition (e.g., the minimum time required for a non-resident of Krypton to draw the gesture); and a summary of other "features" of the gesture.

The goal of much of this data is to quickly exclude candidates that are not likely matches for a particular input data set. E.g., drawing an "ampersand" takes considerably longer than drawing a 5-pointed star -- even though both have comparable total stroke lengths! And, drawing *either* of those takes infinitely longer than drawing a simple "stroke" (dash, etc.). By examining

*summaries* of the raw data set, these criteria let the recognizer discount certain possibilities (or, at least, decide they stand much less chance of being successful matches so the recognizer's attention is better directed elsewhere).

Models/templates are (relatively) static. So, any computationally significant "characteristics" of those models can be performed ahead of time. This reduces the time required in the recognizer *if* these can be

*useful* characteristics!

Borrowing from the Rubine approach, I store many of the "features" of the templates. Since I have a representation of the gesture's (ideal) "path", computing these features is fairly easy (it *could* be done at compile time *or* as part of an initialization stage at run-time).

So, I store things like total stroke length (equivalent to the "amount of ink" that would be used in writing said gesture), total angle (for a circle/square/triangle this would be 360 deg; for an 'L, it would be 90, etc.), aspect ratio ('1' for a circle; '0' for a hyphen; 'infinity' for a vertical line, etc.), etc.

The "value" of these features is that they are economical to compute *and* can be computed (for the raw input data set)

*as* the data is being generated. Contrast this with other aspects of the recognition process that cannot be performed until AFTER the input data is "complete".

Using these features, I prioritize the templates that I examine in further attempts to discount unlikely candidates and narrow in on the "actual" result.

For templates that look "highly likely to be worth considering as potential matches for the input data set, I subdivide the mathematical representation of the template's "path" to yield a number of discrete data points proportional to the number reported in the input data set (which has been manipulated prior to this).

This is an attempt to further constrain the DTW algorithm so that it has *less* flexibility in producing a correspondence between input and template data points.

Roughly speaking, DTW maps the start point of the input data to the start point of the template. The same applies to the "end" points. Since DTW tolerates *differences* in the numbers of points in the input and template data sets, everything between start and end is "elastic".

I'm trying to make that elastic more *rigid*. I.e., the point "in the middle" of the template's path "should" map to the point "in the middle" of the path drawn by the input data set. And, similarly, for the rest of the points in each set.

Ideally, I end up with N points in each of the template and input data set (N is defined *by* the input data set and the corresponding point in each template candidate are

*computed* to satisfy this N!). Then, I compute the "difference" between the input and template curves based on these points. Of course, the goal is to find the smallest difference/tightest fit.

Some (bogus!) threshold differentiates between "good enough" and "no cigar" in the final decision. (Note that I currently pick *the* best "good enough" candidate as "the" result. I probably need to return a *set* of likely results with some sort of confidence ratings and let the application layer cast the final decision -- including, possibly, asking the use for clarification!)

In summary, I use inexpensive computations to try to discount as many candidates as possible. (Rubinesque) Features are "free" for the numerous templates and inexpensive for the

*sole* input data set. The explosive computational needs of DTW are constrained by ensuring a ~1:1 correspondence of input and template data points. The "difference" operator is a bit more expensive than the simple Euclidean distance operator used in DTW (e.g., I actually measure the *area* between the "curves"). But, the cost of ensuring the 1:1 mapping is *not* born by these other algorithms (penalty mine). I believe much of that cost is a consequence of the way in which I represent the template "paths" (beziers).

The truly unfortunate aspect of all this is that it still doesn't "guarantee" that the chosen template is "correct"! (perhaps that simply isn't possible when trying to do this sort of pattern recognition?)

So, the real questions are:

- anyone gone down this road before (good/bad experiences)

- other ways to represent "paths" (that don't have explosive costs associated with their representations) that are easier to "subdivide"

- easier ways of applying tranforms (I've glossed over the "post processing" of the input data set) to support scaling/translation/rotation

- other BFM that might be applicable to this domain

- any existing products that are worth evaluating to get a feel for how well *they* perform (e.g., most writing recognizers have too high of a failure rate to be applied this way -- think of the car driving analogy!)

- other "less arbitrary" criteria to make the final "no cigar" decision on a matching template

[I suspect I've seen most of the classic and current literature on this subject -- in some form. And, most of the available FOSS-ish implementations are far LESS capable than what I've done.]

Thx,

- 1
- 1 Lucky Texan
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Mon, Nov 28, 2011 7:17 PM

******[I could conceivably support a form of "type-ahead" but it would be impossible to design with deterministic support for that potential overloading as you wouldn't be able to predict if future processing requirements would INCREASE beyond what you are currently UNABLE to handle! So, best plan on handling things as they happen] ******

Is there a way to implement any 'lock-outs'? That is, if certain commands would NEVER follow certain different commands, would locking out of one or more 'disallowed' commands help to narrow the 'set' from which a subsequent entry is chosen?

seems like an interesting project.

- D
- Don Y
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Tue, Nov 29, 2011 3:17 PM

Yes. My UI is implemented as a state machine (I can't understand how anyone would *not* use an FSM for this purpose!). So, in each state, the list of valid "inputs" ("commands" -- things that can result in a state transition) are known (by the design of the state machine!). Rather than letting *any* "command" be delivered ("recognized") to that atate machine (i.e., by the gesture recognizer) and then *discarding* "commands" that aren't acceptable (in that particular state), I tell the recognizer what *is* acceptable and let *it* filter out the gestures that wouldn't be allowed.

This has the result of reducing the set of potential gestures that might have to be considered at any given point in time (which reduces the workload).

It also reduces the potential ambiguities that might otherwise result from an "overly rich" feature set! E.g., if and are both in the lexicon at a given instant, then the "roundness" of the gesture becomes a critical issue (i.e., a looks like a very poorly drawn and vice versa).

OTOH, if only *one* of these is acceptable at a particular instant (e.g., ), then the significance of the "roundness feature" is greatly diminished (assuming there are no other legal "closed loop" figures in the lexicon). So, choosing between, e.g., and is a cake walk!

[I (poorly) track "misrecognitions" and try to make guesses as to what a particular gesture was *supposed* to be. I use this to help me identify "competing" gestures -- e.g., 'O' vs '0'. Eventually, I'll figure out a suitable API that lets applications query this information and adjust their "command (gesture) sets" to minimize the potential for "near misses" like that]

The problem is that it is very easy to end up with

*lots* of "legal" gestures. E.g., should it matter of you draw that clockwise vs. counterclockwise? Are they both considered s? Or, are they different gestures from each other -- vs. ? And, what if you started drawing from the 6-o'clock position vs. 12-o'clock?

Ideally, you want to give the user the freedom to draw the gesture in whatever manner is most convenient/reliable for him/her. So, there are

4 (e.g.) ways to draw that !

With an off-line recognizer, you wouldn't care. You'd just look at the resulting "ink" and see a -- you would have no knowledge of

*how* (temporally) the ink was applied.

OTOH, having access to the temporal aspect of the drawing lets you enrich your gesture set without having to create all sorts of bizarre "drawing paths". So, you can have a small set of "paths" (this is a bad term but one I have saddled myself with since early on) that are easy to draw AND easy to differentiate between -- and just allow variations in how they are "issued" (CW vs CCW, top-down vs bottom-up, LR vs RL, etc.) to give you *more* gestures to make available to the user.

This small set of "simple paths" can give you a large number of "commands" with very little processing complexity. E.g., imagine if the "next page" and "previous page" gestures were "horizontal line" and "vertical line" (two substantially different paths) -- each of which could be "drawn" any way desired -- instead of and , respectively.

But, the gesture recognizer still has to be able to *recognize* each variation -- whether they are thereafter mapped to a single "command" *or* are treated as separate commands (mapped to different "events").

That's what makes it worth doing!

- 1
- 1 Lucky Texan
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Wed, Nov 30, 2011 4:33 AM

Since you're using the term gesture, i assume a clockwise circle IS different from a CCW circle. To me, gesture implies dynamics.

And, it seems that, within those 'sets'/'sequences' that are allowed to follow each other, choosing shapes/gestures with maximum 'differences in form' would also be hlpful. As in staright lines follow curves, curves follow straight lines, etc. Semaphore comes to mind as well as Morse code for some reason when thinking about this device.

I once read where a signature verifying program weighted the TIME a signature was written more than the shape. Seems a forger would go quite slowly compared to the legitimate 'author'.

Dunno if your devices would be 'assigned' to the same user or not. If it is, then 'training' and 'auto-correlating' should improve accuracy.

it's all over my head except for the basic concept. i applaud your efforts though. It seems to me some of what your working on could be useful for disabled folks with certain neurological or motor-control problems.

fun stuff to think about

- D
- Don Y
  
  Contact options for registered users
Vote on answer
posted
12 years ago

Thu, Dec 1, 2011 8:54 PM

Yes, in my case, that's true. (note that the term gesture has many meanings if you search the literature. That's why I tried to present "The Backstory")

Note, however, that my implementation doesn't *require* these to be different! E.g., I could define "circle" and "square" to mean "cut" and "paste".

"Does it matter *how* I draw the circle/square?" "No. As long as it is recognizable as a circle/square"

This gives the user freedom in how he issues the gestures. I.e., with the above command binding, why should circleCW be different from circleCCW? Should one of these be "correct" while the other signals an error -- "unrecognized gesture"?

OTOH, there are times when there *is* some difference that wants to be highlighted by the different *way* the gesture was issued. E.g., imagine controlling an inspection system with these gestures. There, circleCW might mean "rotate the object/image 90 degrees in a clockwise direction (so I can get better look at some part of it)" while the circleCCW gesture causes a rotation in the opposite direction. (like "horizontal stroke right-to-left" can mean "move to the next page" while the mirror image gesture can mean "move to the previous page")

You might also want to provide some flexibility to the user in *where* the gesture begins. E.g., I tend draw circles beginning from the 12-o'clock position. But, I've often seen folks who start at 6-o'clock. So, two different starting points, two different directions. Four different gestures??

I define the "paths" that define the gesture -- i.e., circle, square, slash, cup, box, etc. In a sense, these are "shapes" but since they also have some sense of "time" embodied in them (i.e., the direction in which it is drawn), I call them "paths".

Each path is named (internally and externally).

A path has certain "features" that can be computed (compile time) from that static data. E.g., its aspect ratio, center of mass, total angle, curviness, start-to-end vector, etc. These features greatly distill the essence of the path in an attempt to quickly differentiate "circle" from "stroke".

Paths can be easily *transformed*. Simple rotations and reflections are the most apparent. E.g., a reflected circle becomes a circle "drawn in the reverse direction". Note that most features can be easily transformed to correspond with the transformed path! (e.g., reflect a circle and its total angle is negated but it's aspect ratio remains the same, etc.)

Transformed gestures are easily named in a canonical form: circle reflected, box rotated, etc. So, once you have an understanding of the transforms, you can visualize what any named path would look/feel like.

The recognizer takes a list of paths, transforms, etc. as the "set of recognizable gestures" at any given point in time. It examines the current input data set trying to fit the data to one of these gestures.

*If* it can, it passes a gesture-specific event to the application -- the equivalent of a keystroke.

I see your point but that limits what the user can do in the interface. E.g. you couldn't "move to next page" followed by "move to next page".

Instead, what I do is stall the recognizer after it emits a "keystroke" (i.e., if there is more "raw data" to be processed, it just sits there, idle). Then, let the application/UI process the keystroke and decide what the *next* set of valid gestures should be. These are then passed to the recognizer and it processes any pending and subsequent input data in that context.

Since the recognizer is operating synchronously, it has to be *fast* (the application/UI must be equally quick in its decisions as to "what can come next").

Yes, I think that was initially true (no idea if it remains so). AFAICT, many of the "signature pads" don't do any checking of the data but, rather, act simply as "paper replacements".

No. A big part of the motivation behind this particular implementation was to provide "operator independence". But, not just in allowing multiple users to share a device! Instead, facilitating the transport of "operator characteristics" between devices and technologies.

I.e., *you* should be able to use your friend's device with the same level of recognition accuracy that you get on your own!

I don't want an explicit "training phase" in which the device is "taught" how you issue each gesture. Rather, I want you to be able to begin using the device immediately. The device can *learn* how you issue each gesture as you use it. E.g., each "recognition" that you "accept" (as correct) can be used to identify your particular "gesture style". Similarly, each gesture that you *reject* (as incorrect) can be

*discounted* from the training set (I haven't yet figured out an algorithmic way of using those "misrecognitions" to identify aspects of the models that are *incorrect/ambguous*)

The approach is intended to be *inclusive* of folks with these sorts of problems. But, not specfically geared towards them at the expense of (ahem) "normal" people.

For example, most "recognizers" use training to modify the templates against which future input data are compared. I.e., tweek the machine's idea of a to match what the user *draws* as a .

This *might* increase recognition accuracy for future s. But, it does nothing for future "figures-of-8"! To get a similar training benefit for those, the user would have to issue a comparable number of examples from which the recognizer could "learn" the deviations from the model that the user introduces for *that* gesture. If figures-of-8 are less common in the interface, then this learning can take quite a long time -- without an explicit "learning exercise" (in which the user is unnaturally prompted to issue gestures under the COMMAND of the device! IMO, a device should never "drive" the user! :< )

If, instead, you can identify "deformations" to the model that account for the user's "drawing habits", then you can conceivably apply those to *other* models -- regardless of the shapes involved -- to improve their recognition as well!

E.g., if you tend to draw s on a slight angle, chances are you will draw other paths with a similar bias. For example, that touchpad mounted on the jacket lapel might be a little crooked.

*Or*, it might be perfectly straight but the position of your arm across you chest as you write on it might lend that sort of skew to ALL of your gestures! If this bias can be extracted and applied to all of the gesture models (for *you*), then recognition of ALL of them can be improved. [And, *your* "characteristics" can be codified in this "transform" which can be ported to other similar devices to summarily adjust their gesture models to your needs]

For certain physical characteristics (e.g., disabilities), the nature of the transforms would change -- but would still apply to *all* gestures (as there is a physical or anatomical basis underlying the transform). E.g., folks with ET would tend to exhibit a "high" frequency left-right oscillation that wouldn't be present in folks without that problem. At the same time, there might be *no* top-bottom component present.

Beats the hell out of "shell sorts"! ;-)