Sequence matching

D

Don Y 11 years ago

Hi,

I have a process that emits element_t's at some rate. The number that will be emitted is not known a priori. And, will vary with each invocation.

Another process accumulates these element_t's into an array (list?). When the final element_t has been delivered, this second process will be responsible for determining if the accumulated array matches one or more *const* element_t arrays "hardcoded" into it's implementation.

[That gives a fair bit of leeway as to *how* they are encoded!]

*Or*, any of a number of arrays that have been dynamically created over time.

[Preference is given to the dynamic arrays over the const ones]

I want the second process to deliver a result of the form: {NO_MATCH, PARTIAL_MATCH, EXACT_MATCH} along with a pointer to the const/dynamic array that the input best/first fits.

Assume you have something *like*: result_t compare(element_t input, element_t pattern) that returns: {IS_BEFORE, IS_COINCIDENT, IS_AFTER} to give an ordering to the two element_t's that it examines (i.e., "input result_t pattern")

[E.g., if these were character arrays, then "ABC" would be an exact match if "ABC" was encountered in one of the const/dynamic selections; it would be a partial match if "ABCD" was encountered in the selections; and NO match if "AFRH" was (the only) selection available.]

Assume most invocations will result in some form of "match" (this can have an impact on the choice of algorithm)

You *could* accumulate element_t's as they are emitted and, later, process them as a completed entity -- instead of handling them one at a time (pay me now; pay me later). But, that means assuming the cost of a (dynamic!) structure *just* to accumulate them (in addition to any other structures that the algorithm requires/builds)!

And, of course, handling them as they are emitted can give some performance advantages (e.g., take an early FAIL exit in some cases; and, probably, ALWAYS get a final result quicker -- from the time of the first process's termination)

The choice of encoding scheme for the const/dynamic selections is flexible -- trade space for processing time, etc. (this leaves a LOT of options on the table!)

What I am most concerned with is the time from the delivery of the "final" element_t to the declaration of a result. And, the resources consumed getting to that point (e.g., walking a tree requires accumulating state to know how to climb back *out* when you chase a failed branch!).

Assume this runs on a 32b architecture (e.g., pointers cost you

32b). And, element_t's will definitely be larger than char's (I think I can limit them to 32b -- worst case, *element_t's at the expense of an indirection)

As the "dynamic" selections are accumulated at run-time, the costs of any algorithm that imposes a specific structure on them should be taken into consideration (esp because writable store is more expensive -- and in shorter supply -- than R/O store!). E.g., merging them into a sorted list, tree/trie, "compiled" representation, etc.

For folks who shudder at abstractions, pretending element_t's are char's allows you to consider the problem as one of "comparing strings" -- against a fixed dictionary *and* a dynamically maintained dictionary. Except, of course, the cost of compares is considerably higher!

[This is the approach I am currently taking as I chase down applicable algorithms. I've made a first pass through Knuth and have now started looking a bit further afield for ideas (e.g., compiled regex's)]

Thx,

--don

Vote

S

srl100 11 years ago

Hack out a simple solution (e.g. gather elements in phase 1, do all comparisons in phase 2).
Optimise it if you have to (e.g. when you learn more about the real requirements and bottlenecks).

Any ideas from your co-workers?

--------------------------------------- Posted through

formatting link

Vote

D

Don Y 11 years ago

Actually, spent the evening with some colleagues going over some of these issues (lots of emails flying back and forth, code fragments, sketches, etc).

Accumulating element_t's (in process 1 -- or 1.5) and then processing them doesn't buy you anything (if you were looking for a hash on the completed string, it might). Given that you don't know how long a particular sequence will be, the ideal solution is to process and "discard" each element_t as soon as you can -- so you only have to size the algorithm to process a single element_t, incrementally.

We came up with an algorithm that looks like it will work. And, be "templatizable" (ick!) -- so I can apply it in a few different applications.

I'll try and code it up later in the week after stewing on it for a while. In the near term, I have a couple of birthday parties to bake for (and this rain significantly affects my plans for what I was going to prepare :< )

Vote

C

Clifford Heath 11 years ago

Don,

A trie is the ideal structure for this problem. Each new element_t either fails the search for a match, or descends to a trie that contains possible matches. If you have a very large number of possible patterns to match, with many common prefixes, the trie can be encoded as a PATRICIA tree - also described by Knuth. Probably not worth it though, given your description.

Clifford Heath.

Vote

D

Don Y 11 years ago

Yes, I was brainstorming this with friends, last night. We decided the "stumbling block" was *assuming* that the "const" templates had to be encoded the same way as the "dynamic" ones.

ROM is (relatively) cheap and "plentiful". RAM, neither.

So, opt for the more "expensive" (all else being equal) trie approach for the canned templates. Then, something that makes the *opposite* tradeoff (space over time) for the "dynamic" ones. This allows the cost (MIPS) of building the trie for the const ones to be born at compile time and lets the cost (time) of maintaining (adding new entries) the dynamic ones be smaller/simpler at run time.

And, with care, both approaches can be crafted to work incrementally -- i.e., let your location in the trie effectively represent ALL of the element_t's that were processed to get you there!

I'm hopeful. But, currently distracted so I can't invest the time (yet) to see how well it performs.

Thx!

Vote

Sequence matching

Join the Discussion

Didn't find your answer?