'Favourite' methods for scanning/parsing input data?

- M
- Mike G
  
  Contact options for registered users
posted
17 years ago

Mon, Aug 21, 2006 1:47 PM

Hi,

I was reading an interesting article on Embedded.com about using Lex & Yacc in embedded applications. Here's the URL (please don't think I'm being patronising to the regulars - I just thought including it may be useful)

formatting link

It just started me wondering - are there any 'pet' methods that you'd like to share?

Regards, Mike

- S
- Steve at fivetrees
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Mon, Aug 21, 2006 2:05 PM

I use a loop, reading characters with all error handling (including timeouts) in one place, and a state machine that effectively maps the syntax structure of the data being read, character by character.

As we've discussed here before, reading a block of data and casting it to a packet structure is a really bad idea. Yet it seems quite common.

Steve

formatting link

- J
- Jim Stewart
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Mon, Aug 21, 2006 5:44 PM

String compare in assy.

- P
- Pete Fenelon
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Mon, Aug 21, 2006 6:42 PM

If I'm designing protocols myself for embedded, I prefer to make the command side of things fixed-length binary packets, with data (of length described in the command packet) following. Sometimes you can't get away with that though ;)

A halfway house between text and binary that's a useful "cheat" for small homebrewed protocols that still need to be vaguely readable is using short (1, 2 or 4 octet) fixed-length commands, then using their representation as 8, 16 or 32-bit integers as cases in a switch statement or entries in a hash table of functions to dispatch to...

In a PPOE we decided against lex for text-based parsers and rolled our own very tight lexer (there were only really four types of token it needed to recognise, keyword, number, identifier and quoted string) and interfaced that to a Bison-generated parser. The same lexer has been reused across many different projects, and did a decent enough job for us!

pete

--
pete@fenelon.com "I once coaxed a dog into a library" - Tommy Saxondale

- A
- Anton Erasmus
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Mon, Aug 21, 2006 7:44 PM

The code Generated by Lex & Yacc requires a lot of RAM, which is a problem on most MCUs. Even for quite simple things it easily requires more than 64K of RAM. Parsifal Soft had a program called Anagram that could generate a state machine based parser that could easily run on small MCUs. Unfortunately the owner died, and it has been impossible to get hold of a copy ever since. A very good example of why open source is a good idea. Anybody have a copy of Anagram they would be willing to sell ?

Regards Anton Erasmus

- B
- BobH
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Tue, Aug 22, 2006 1:12 AM

Other than all of the compiler portability issues, why don't you like casting to a packet structure after the full packet has been received?

Curious, Bob

- S
- Steve at fivetrees
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Tue, Aug 22, 2006 1:51 AM

I'm glad you asked ;).

Let's assume the comms line (ignore TCP/IP for the moment) is slightly flakey, and one byte gets lost (perhaps a parity error) along the way. Consider what the effect would be on a packet digester. Also, if we were expecting a packet of a certain size, perhaps we won't see the full complement. What would happen then? Timeout? Or maybe use the first byte from the next packet?

With TCP/IP, it's slightly easier. But even then, suppose we receive a malformed packet. Consider the implications.

In all of these cases, we've probably cast garbage onto the structure. That, at best, means we have to validate each element of the structure. Which means we're back to ensuring the syntax was enforced at the protocol level. Which is where I came in.

And then there's the portability (endianness etc) you mentioned...

Steve

formatting link

- B
- BobH
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Tue, Aug 22, 2006 3:39 AM

I am not all that thrilled with the casting to a structure method, but I have used it. My main objections are the portability issues. As I recall from the C standard, structure packing, padding, order and such are left up to the compiler writer. Then you get object size issues, as fond as everybody is of the 8 bit accessibility, some processors have 16 bit minimum access size (DSP's). Endianness is yet another issue. Most of these issues can be sorted out either with compiler flags or ifdefs, but it is painful to change compiler vendors or worse yet processors.

The packet validity can't be assumed unless your protocol has checksumed or better validation, regardless of how the data is taken out of the input data. The timeout and header/trailer fields on the packet help, but in a binary protocol, unless you do bitstuffing, it is possible for the header values to appear in the data stream.

The structure method lets the compiler deal with the offset calulations (for better or worse) instead of manually walking a pointer through the data and assigning it to variables. The structure method is probably smaller code-wise than manually walking a pointer through.

My thinking is that if bad things happen as a result of bad values, sanity check them, regardless of how they are parsed.

Thanks for your thoughts, Bob

- P
- Paul Keinanen
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Tue, Aug 22, 2006 7:35 AM

As long as the limits are fixed at compile time, the limit checking of a floating point value does not cost much, even with 8 bit integer instructions only. No floating point subtraction (which involves costly denormalisation) is required.

To check if a received floating point value is above a limit, just compare the exponent part of the value with the exponent part of the limit. If the value exponent is greater, the whole value is greater than the limit. If the value exponent is less than the limit exponent, the value is definitively less than the limit. Only when the value exponent and the limit exponent are the same, there is a need to compare the mantissa parts. Starting with the most significant part compare bytes/words until a difference is found.

Paul

- M
- Mark Borgerson
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Tue, Aug 22, 2006 5:21 PM

I think that would work pretty well in the most common cases that I encounter: A gyro rate that should be inside +/-300 deg/sec ends up at 85314---or some other very large number.

The problem with your test algorithm where error density is low is that it requires the most cycles when values are inside limits.

I may have to try your approach, but limit it to just the exponent test, then see what percentage of the actual errors it catches. Since the numbers go through a digital filter, error values which are less than

2X the limiting value won't have as much disruptive effect as numbers that are 2000 times the limit.

Since my particular limits are symmetric about zero, I may even be able to work a bit of magic with the sign bit and perform only one test.

(magic = as-yet-undefined shifts and masking operations)

Mark Borgerson

- P
- Paul Keinanen
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Tue, Aug 22, 2006 8:05 PM

This is definitively true if the valid range is very limited e.g. [23.0 .. 29.9], in which a single sign test and a single exponent compare and two (at least partial) mantissa compares are required.

However, with a larger valid range, e.g. [1.5 .. 100.0], any sample in the [2.0 .. 64.0] range will require only (two sign tests and) two exponent compares. If the sample value is in [1.5 .. 2.0] range or [64.0 .. 100.0] range, _one_ additional (usually partial) mantissa compare is required. With constant sample distribution, about 2/3 cases in this example could be handled with just the exponent compares. Of the remaining 1/3 cases, most would be handled with a single integer compare, unless the value is very close to the limit.

With IEEE bit float, just clear the sign bit (leftmost bit). On an 8 bit processor, get the next 8 bits (exponent) and compare to the limit exponent. If the exponents are equal and an accurate limit test is required, compare the three rightmost bytes in the sample and in the limit. The comparison can be terminated, when there is a difference.

On a 16 bit processor, get the leftmost 16 bits, mask of the sign bit and mantissa bits and compare the whole 16 bit value with the limit value (with mantissa bits masked off). If the masked sample value exponent is less, the sample is OK, if greater, the value is invalid.

If the exponents are equal and more accurate check is required, get the original leftmost 16 bits, mask of the sign bit, compare the leftmost 16 bits of the leftmost 16 bits of the limit. There is no need to mask off the exponent, since these are the same in both values and any possible difference will only occur in the mantissa bits. Only if this comparison produced an equal result, then the rightmost 16 bit word must also be compared.

Paul

- S
- Steve at fivetrees
  
  Contact options for registered users
Vote on answer
posted
17 years ago

Tue, Aug 22, 2006 8:24 PM

Erm... I presume these FP values are within a checksummed or CRC'ed packet?

Steve

formatting link