I'll skip over the chronology behind my current implementation. Assume I've abandoned/refined various approaches because they couldn't provide "acceptable" performance.
Rubine-style (Rubinesque ;-) ) recognizers are speedy and simple to implement. But, they seem to produce bizarre results with an unforgivably high frequency! They seem to fall below the "Make everything as simple as possible, but no simpler" yardstick. I spent a *lot* of time reviewing raw data trying to locate elusive "bugs" in that recognizer only to discover that it was correctly misrecognizing that data!
[I suspect a problem here is that the feature sets distill too much from the gestures -- esp much of the temporal relationships -- and leave too many potential unresolved conflicts in the data set. So, unexpected characteristics of the input have disproportionate influence on the results]DTW recognizers are incredibly expensive. And, enforcing realistic constraints on the point matching/warping is rocket science/mysticism. You can't "prove" that the criteria you have adopted are "correct" -- just that they
*seem* to improve performance FOR YOUR PARTICULAR DATA AND GESTURE SET!OTOH, both of the above benefit from the fact that as the size of the raw data increases ("long"/complex gestures and/or slowly drawn), there is more "real time" available for their recognition (because a gesture can't be expected to be recognizable until it has been completely issued!)
[OTOOH, the increased size of that data set means you had better be processing whatever you can *as* you collect it else you will be taxed trying to crunch it all in the milliseconds following the gesture's issuance!]To make things affordable (esp to hit a ~100ms recognition window), you have to seriously reduce the data that you have to process. OTOH, you want enough incoming data that will support fine detail in digitizing the motion as well as supporting a wide range of "drawing speeds".
The first win is to design the recognizer such that it is constrained as to the set of valid gestures that it must consider at any given time. This reduces the number of potential match candidates that must be (numerically) evaluated. [It does so at the risk of being "confused" by a "syntax error" -- a gesture issued illegally!]
So, I pass a set of legal gestures to the recognizer prior to each possible gesture issuance. Since one gesture's (successful?) recognition can define the set of valid subsequent gestures, this means I must be able to recognize gesture #1, pass it to the UI state machine and determine the set of valid subsequent gestures *prior* to gesture #2 being STARTED.
[I could conceivably support a form of "type-ahead" but it would be impossible to design with deterministic support for that potential overloading as you wouldn't be able to predict if future processing requirements would INCREASE beyond what you are currently UNABLE to handle! So, best plan on handling things as they happen]A key aspect of this gesture set is the omnipresence of a "none-of-the-above" result. The recognizer can't assume that the input data set *is* one of these gestures (i.e., it's not a question of "pick the best fit" but, rather, pick the *likely* fit *if* likely)
For each possible gesture result, I note key characteristics of that "model/template" gesture. These include a formal definition of the gesture (i.e., a mathematical representation of the "path" traversed in its issuance); the "event" that should be signalled on its recognition; constraints placed on its recognition (e.g., the minimum time required for a non-resident of Krypton to draw the gesture); and a summary of other "features" of the gesture.
The goal of much of this data is to quickly exclude candidates that are not likely matches for a particular input data set. E.g., drawing an "ampersand" takes considerably longer than drawing a 5-pointed star -- even though both have comparable total stroke lengths! And, drawing *either* of those takes infinitely longer than drawing a simple "stroke" (dash, etc.). By examining
*summaries* of the raw data set, these criteria let the recognizer discount certain possibilities (or, at least, decide they stand much less chance of being successful matches so the recognizer's attention is better directed elsewhere).Models/templates are (relatively) static. So, any computationally significant "characteristics" of those models can be performed ahead of time. This reduces the time required in the recognizer *if* these can be
*useful* characteristics!Borrowing from the Rubine approach, I store many of the "features" of the templates. Since I have a representation of the gesture's (ideal) "path", computing these features is fairly easy (it *could* be done at compile time *or* as part of an initialization stage at run-time).
So, I store things like total stroke length (equivalent to the "amount of ink" that would be used in writing said gesture), total angle (for a circle/square/triangle this would be 360 deg; for an 'L, it would be 90, etc.), aspect ratio ('1' for a circle; '0' for a hyphen; 'infinity' for a vertical line, etc.), etc.
The "value" of these features is that they are economical to compute *and* can be computed (for the raw input data set)
*as* the data is being generated. Contrast this with other aspects of the recognition process that cannot be performed until AFTER the input data is "complete".Using these features, I prioritize the templates that I examine in further attempts to discount unlikely candidates and narrow in on the "actual" result.
For templates that look "highly likely to be worth considering as potential matches for the input data set, I subdivide the mathematical representation of the template's "path" to yield a number of discrete data points proportional to the number reported in the input data set (which has been manipulated prior to this).
This is an attempt to further constrain the DTW algorithm so that it has *less* flexibility in producing a correspondence between input and template data points.
Roughly speaking, DTW maps the start point of the input data to the start point of the template. The same applies to the "end" points. Since DTW tolerates *differences* in the numbers of points in the input and template data sets, everything between start and end is "elastic".
I'm trying to make that elastic more *rigid*. I.e., the point "in the middle" of the template's path "should" map to the point "in the middle" of the path drawn by the input data set. And, similarly, for the rest of the points in each set.
Ideally, I end up with N points in each of the template and input data set (N is defined *by* the input data set and the corresponding point in each template candidate are
*computed* to satisfy this N!). Then, I compute the "difference" between the input and template curves based on these points. Of course, the goal is to find the smallest difference/tightest fit.Some (bogus!) threshold differentiates between "good enough" and "no cigar" in the final decision. (Note that I currently pick *the* best "good enough" candidate as "the" result. I probably need to return a *set* of likely results with some sort of confidence ratings and let the application layer cast the final decision -- including, possibly, asking the use for clarification!)
In summary, I use inexpensive computations to try to discount as many candidates as possible. (Rubinesque) Features are "free" for the numerous templates and inexpensive for the
*sole* input data set. The explosive computational needs of DTW are constrained by ensuring a ~1:1 correspondence of input and template data points. The "difference" operator is a bit more expensive than the simple Euclidean distance operator used in DTW (e.g., I actually measure the *area* between the "curves"). But, the cost of ensuring the 1:1 mapping is *not* born by these other algorithms (penalty mine). I believe much of that cost is a consequence of the way in which I represent the template "paths" (beziers).The truly unfortunate aspect of all this is that it still doesn't "guarantee" that the chosen template is "correct"! (perhaps that simply isn't possible when trying to do this sort of pattern recognition?)
So, the real questions are:
- anyone gone down this road before (good/bad experiences)
- other ways to represent "paths" (that don't have explosive costs associated with their representations) that are easier to "subdivide"
- easier ways of applying tranforms (I've glossed over the "post processing" of the input data set) to support scaling/translation/rotation
- other BFM that might be applicable to this domain
- any existing products that are worth evaluating to get a feel for how well *they* perform (e.g., most writing recognizers have too high of a failure rate to be applied this way -- think of the car driving analogy!)
- other "less arbitrary" criteria to make the final "no cigar" decision on a matching template
[I suspect I've seen most of the classic and current literature on this subject -- in some form. And, most of the available FOSS-ish implementations are far LESS capable than what I've done.]Thx,