Command language parsing - how formal to get?

- C
- Chris Carlen
  
  Contact options for registered users
posted
16 years ago

Fri, Aug 10, 2007 5:12 PM

Hi:

Having completed enough serial driver code for a TMS320F2812 microcontroller to talk to a terminal, I am now trying different approaches to command interpretation.

I have a very simple command set consisting of several single letter commands which take no arguments. A few additional single letter commands take arguments:

Command language (somewhat generalized):

command argument(s) description

a do action_a b do action_b c do action_c ... n do action_n ... y [var_name] display variable(s) z var_name value set variable to value

where var_name is a variable identifier. The 'd' command lets one display the value of a variable identified by var_name. If no var_name is provide, this command will display all variables. The 'e' command lets one set the value of the variable identified by var_name to value.

Presently I am using the simplest yet least flexible approach to parse commands. I scan and lexically analyze on the fly, and use if, else if, else logic blocks to parse. (I know, pathetic.) This approach requires one if, else if, else block for each field of the command language.

Next I plan to split the lexical analysis to a first-pass, and create a list of tokens, with additional processing such as converting keywords and variable names to keys, and numeric text fields to machine numbers. This will eliminate the scanning/lexing on the fly to make the parsing code much more readable.

Then a second pass of parsing could still be done using if, else if, else it seems in the case that a language has a fixed and relatively small maximum number of fields per command.

The question is then, for a language similar to the above and considering these factors:

need to implement in an embedded microcontroller in not more than a few K of program memory
not a particularly high speed of parsing is required. Ccommands will be issued intermittently over a slow interface of perhaps 115.2kbps.
ease of understanding and adapting the code to changes in the command set is very valuable. Most likely additions would be a command to tell the thing to accept a binary data stream representing a file/table to be stored in the uC RAM.

Is it common to hardcode parsing with if, else if, else blocks or at what point does one decide to use more generalized techniques such as formalizing the language description in BNR form, and using lexer/parser generation tools such as lex/yacc ?

I have always found the topic of parsers/interpreters/compilers fascinating, and would like to spend considerable time learning more about this. Yet I also have to "just get the thing to work." I feel like I will ultimately need to perform quite a few experiments with different parsing techniques to get a feel for the best method to choose for this and future applications.

Thanks for input.

--
Good day!

________________________________________
Christopher R. Carlen
Principal Laser&Electronics Technologist
Sandia National Laboratories CA USA
crcarleRemoveThis@BOGUSsandia.gov
NOTE, delete texts: "RemoveThis" and
"BOGUS" from email address to reply.

- J
- Jonathan Kirwan
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Fri, Aug 10, 2007 6:36 PM

Hi, Chris. Just a few comments.

(1) I wouldn't design the commands and queries as you did. I like the recommendations of IEEE-488-2. One of them is to use a ? for queries and to reply to such queries with a command that can also be parsed.

For example, in your above list, you list 'y' and 'z' for reading and setting variable values. The recommended procedure would be to use the same text for both actions, but in this way:

var? foo

which would cause the following reply:

var foo 5.3

and you'd say:

var foo 7

to change it to 7.

The query form is just a ? added to the end of a command form, in other words.

It also recommends that you implement a single, simple command that dumps out the current state. That state should be listed out in such a way that if the receiver of it simply sent the entire body straight back to your device, it would have the effect of restoring the state. So if you had three variables that represented all of the important state in your device, call them S1, S2, and S3, then someone might issue the following query:

state?

and get the following response:

var S1 4.1 var S2 6.5 var S3 -1.2

and that if the receiver then fired those straight back in, the state (no matter what it was before) would be restored to those values.

(2) Here's an option to consider:

for ( ; ; ) { auto void (*func)( const char * cmd ); auto int status; static char commandname[MAXCOMMANDNAME]; skipwhitespace( ); parsecmd( commandname, MAXCOMMANDNAME ); if ( (func= commandsearch( commandname )) != 0 ) status= (*func)( command ); else status= NOSUCHCOMMAND; if ( status != SUCCESS ) { skipremainderofline( ); /* print some kind of parsing error here */ } }

Something alone these lines. The 'parsecmd' function might use the if..else structure you already have. I tend to use a table that lists the text of each command/query along with a function pointer to the function to call for the rest of the parsing and execution. There have been times in my life when, for legacy reasons, I've wanted to have several command/query names for the same effect, for example, and this makes that pretty easy to do.

Anyway, just a thought.

(3) I don't know if you intend to do other things, while parsing. In most stuff I do, it's essential that I don't get stuck, nested down in some parsing code, when there is other work that also needs monitoring. For example, I may need to take action on the basis of a push button that might also be pressed at about the same time I'm down parsing serial port text. And I can't afford to sit around waiting for the next character of a command, with the code buried in the middle of some if..else, while there is a keypress that is also demanding attention right now.

So I almost always include something along the lines of an operating system, here. In many cases, this might be nothing more than just implementing cooperating task switches with a switchaway() function. In other words, if I'm in the middle of the serial port code, buried down inside an if..else, and I'm now calling the serial port code to "get the next character" and it responds that there are no more characters just yet... then I call this switchaway() function. When it returns, I'm allowed some more time so I re-call the serial port code for that 'next character' thing. If it is still empty, I recall the switchaway() function, again. And so on. That way, I don't lose my place in the middle of my parsing code but at the same time I can pass along the cpu to other functions so they can get their work done.

It doesn't take more than a few lines of assembly code to implement such a function, by the way. Most compilers set aside only a few registers that need to be saved across a function call like that, so it only needs to push a few registers, switch to a new stack pointer, and then pop the same registers and return. That's it. So it isn't hard and it helps keep your code clean. Otherwise, you'd have to somehow save the state of your parsing function, return from it so other work can get done, then somehow rewind yourself back into some nasty position the next time you get called at the top of it. Or just sit there and ignore everything else while you wait for the next serial port character to arrive.

Best of luck, Jon

- R
- Robert Adsett
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Fri, Aug 10, 2007 10:43 PM

For a simple human (as opposed to machine) interface I've often used a simple lookup table composed of command name/function pointer pairs. Simple brute force lookup, and you can add a simple help string for each command. The functions themselves were responsible for parsing any arguments.

Simple but limited.

Robert

--
Posted via a free Usenet account from http://www.teranews.com

- C
- Chris Carlen
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Fri, Aug 10, 2007 10:48 PM

Thanks for the input.

Interesting suggestion. I suppose since I have 0 experience with designing command languages, most suggestinos will seem interesting :-)

Actually, I have taken inspiration for this command set from DOS debug. I'm also writing a hex line editor in Python (as a Python learning exercise) which also uses this single letter business.

I just downloaded IEEE-488.2 Good grief! I have often shyed away from standards since they are always long and agonizing to read. Would take months just to get familiar with this. So I have no interest in making the gadget fully 488 compliant. It doesn't appear that you were implying that I should.

I had thought about becoming familiar with SCPI to see if it would be applicable to this project as well. It just all seems way too complicated.

This is pretty cool.

commandname will get filled with a command keyword by parsecmd()?

parsecmd() is basically a lexer which extracts the next word ?

Lookup and execute a function to perform the action of that command.

It will take considerable thought. I sort of get what your code does. Just a little vague. I think I can come up with lots of ideas here, though.

I have two critical threads on my gadget, and a non-critical thread which is the serial interface. The top #1 priority is a non-preemptable, timer interrupt driven periodic thread. It is guaranteed to finish in some finite maximum time, but it's period can vary according to parameters which can be influenced by the user interface. Basically, it's a waveform synthesizer with variable clocking period. It's got a max clocking freq. of about

Next is #2, a preemptable (by the top thread) periodic thread which is also timer interrupt driven. This second one has a brief assembly intro which reenables interrupts ASAP so that it can't block #1 for more than a few cycles. This is the hardware user interface (panel buttons) polling thread. It can tolerate being chopped up by #1. It also has a finite maximum exec time.

Thus, the serial code can block for as long as it likes. The only thing that matters is that it doesn't block itself long enough to fail to clear out the serial receive FIFO before the FIFO overflows.

I plan to use the maximum exec times for all the stuff and their periodicities to compute the maximum baud rate at which the device can receive while being certain never to miss characters. At least I think this can be determined.

So that's my OS. With only a few threads, it's easy to just let timers trigger the important ones, and round-robin the rest.

--
Good day!

________________________________________
Christopher R. Carlen
Principal Laser&Electronics Technologist
Sandia National Laboratories CA USA
crcarleRemoveThis@BOGUSsandia.gov
NOTE, delete texts: "RemoveThis" and
"BOGUS" from email address to reply.

- G
- Gene
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Fri, Aug 10, 2007 11:15 PM

Don't apologize! Table-driven methods are nearly always easier to understand and maintain than equivalent "hard code." And they avoid redundancy (such as the comparisons in an elsif chain) that you can't afford in a scarce memory environment. Thus all the old assemblers and BIOS interfaces are implemented that way. Even if you choose a more sophisticated command parsing method than table search, table- oriented implementations of a DFAs and parsers are among the most compact, and yet they are quite fast.

- S
- Steve at fivetrees
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Sat, Aug 11, 2007 12:38 AM

Instead of a function pointer, think in terms of an enum of commands, a structure (class, if you prefer) defining all the properties of the command, and a table of such (constant) structures. One structure element could indeed be a handler function, called with an enum reason value, decoded via a switch within the handler (query vs command etc). Others could define the number of operands and their types (handlers) etc...

As for keeping state while handling other tasks: JK made it sound complex ;). A state machine makes it trivial to pick up the position when a character comes in, and parse the complete line on CR/LF (etc).

Steve

formatting link

- J
- Jonathan Kirwan
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Sat, Aug 11, 2007 1:08 AM

On this point, I agree that state machines aren't hard to do when you have a complete specification. They can be rather hard to document inside the code, though.

As an example where I am really glad to use state machines would be for driving things where I have clock edges to drive in software with special timing of the signals relative to the clock edges. Especially in cases like this because the hardware is usually well specified and not likely to change in the future -- except in very predictable ways (increased addressing width in a serial protocol, for example.)

In parsing cases, it is much more natural to use a recursive descent approach and the co-routine/cooperative switch mechanism fits the paradigm much better in my opinion. You often go around changing parsing in ways that would make maintaining a state machine a bit tricky. This is one reason why there are compiler-compilers, in fact, to automate the construction of state machines for parsing. If you are doing the parsing and execution coding by hand, though, and not with some tool to automate the generation of a state machine, I would rather use a cooperative hand-off mechanism to keep the parsing code simple and straightforward.

Jon

- M
- Mark Borgerson
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Sat, Aug 11, 2007 1:36 AM

I find it useful to buffer the input line, then activate the parser when the CR/LF is received. This allows you to support the backspace key for users who are not perfect typists.

I used a structure to hold the command characters (I used 2-letter commands), a pointer to the command function, and the number and type of parameters.

Mark Borgerson

- C
- CC
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Sat, Aug 11, 2007 2:12 AM

Yeah. I have made a function:

int SCI_gets(char *str, int count);

which behaves much like fgets(), except that a global enum:

typedef enum {NO_ECHO=0, ECHO=1} SCIgetsEchoModes; static SCIgetsEchoModes Gets_echo_mode = ECHO;

control whether it echos. It also handles backspace, and interprets any of '\n', '\r', or '\0' as EOL.

It will return a SCI read error passed up from the underlying SCI_getc(), or an EOF if nothing was available when first called. It blocks however, once one char has been read. It also echos non-printing characters as '?' with red coloring on ANSI compliant terminals. Just plain '?' if a #define is set to NON_ANSI.

So I plan to parse only buffered, terminated strings.

--
_____________________
Christopher R. Carlen
crobc@bogus-remove-me.sbcglobal.net
SuSE 9.1 Linux 2.6.5

- R
- Robert Adsett
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Sat, Aug 11, 2007 3:11 AM

I'm sorry I didn't mean to apologize :)

Really, I was just mentioning a simple, quick method. I was under the impression Chris was really looking for something with a little more sophisticated syntax checking than a strcmp loop.

Speaking of which a friend of mine implemented a simple command interpreter on a PDP (actually a clone I think) in fortran that used two characters in the command. The characters were in a 6 bit encoding and two of them formed a 12 bit word that was used in a command lookup. Apparently you can get quite fast with two character commands if you don't have to worry about things like whitespace and other syntactic fluff.

Robert

--
Posted via a free Usenet account from http://www.teranews.com

- J
- John Larkin
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Sat, Aug 11, 2007 4:54 AM

It's cleaner to do a table-driven thing, rather than a bunch of case statements. With a single letter per command, you can just index to 26 small table entries, each with a handler address, maybe a pointer to a variable, and maybe a flag word.

I recently did a parser with command lines like...

ADelay 12.5n; TRigger REmote; FIre

where each command is a 2-char token (more chars are allowed but ignored), a token plus an arg, or two tokens. All table driven, with maybe 140 commands. The parser proper is about 100 lines of assembly code. It just packs the command chars into a longword as "AD " or "TRRE" or "FI ", looks it up in the table, and dispatches to the named handler.

XTABLE:

; TOKEN1 TOKEN2 HANDLER PARAM

.WORD "US", 0, XLONG, FUSE ; USec timer .WORD "UW", 0, QWORD, FUSE+2 ; USec timer, as a word! .WORD "WA", 0, XWAIT, 0 ; WAit nnn microseconds .WORD "FI", 0, XFIRE, FIRE ; FIre .WORD "GA", "FI", XGFIR, FBFIRE ; GAte FIre .WORD "FE", 0, XFEOD, FEOD ; FEod .WORD "CO", 0, XOK, 0 ; COmment!

.WORD "AD", 0, XTSET, CCBA+CDHI ; ADelay 123.45N .WORD "AW", 0, XTSET, CCBA+CWHI ; AWidth 123.45N .WORD "AS", "ON", XSON, PCBA ; ASet ON .WORD "AS", "OF", XSOF, PCBA ; ASet OFf .WORD "AS", "PO", XSPOS, PCBA ; ASet POs .WORD "AS", "NE", XSNEG, PCBA ; ASet NEg .WORD "AS", 0, XAINQ, CCBA ; ASet inquiry .WORD "AP", 0, XAINQ, PCBA ; APend pending inquiry

John

- S
- Stefan Reuther
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Sat, Aug 11, 2007 10:03 AM

The parsing style I choose also depends on how robust (against users that mistype commands) it has to be. For a general-purpose, shell-style command language, I usually split the input into an argv-style array and hand that to an interpreter function. This allows you to implement a uniform way of quoting arguments, and use the usual idioms and standard functions (such as getopt()) to parse it.

For things that have irregular syntax, I often use sscanf for parsing: if (sscanf(input, "#ifdef %31s", &variable) == 1) handleIfdef(); or if (sscanf(option, "--size=%i%c", &size, &dummy) == 1) ... If you already use sscanf in your program, this may be a nice low-overhead way of parsing, otherwise it's probably quite expensive. It's a little harder to get sensible error messages from the scanf solution (mine would always reply 'unknown option' if you say '--size=X', not '--size wants a number as argument'), but for a debugging shell or internal tools I don't care about that.

For looking up commands, I sometimes use tables, sometimes if/else-if cascades, depending on the mood I'm in. Tables have the advantage that you can place some meta-information in them (like a help message for each command), and play other nice tricks (like tab completion or expansion of abbreviated commands). They have the disadvantage that they need more bureaucracy (either one function per command, or an enum plus a switch of the same size the if/else-if would have).

I don't think lex/yacc give you any advantage for simple shell-style languages. They can get interesting if you have complicated structured languages, such as evaluation of full C expressions, but even that can quite easily and quickly be done with a hand-crafted recursive-descent parser.

Stefan

- A
- Andrew Smallshaw
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Sat, Aug 11, 2007 6:20 PM

A slightly more sophisticated approach is to ensure that the lookup table is sorted and apply a binary search on the table. On each iteration test the mid-point of the range between a known too low and too high value (these begin set to 0 and the last entry in the table). I've used this approach in the past and in C it amounts to maybe 3 or 4 additional lines of code. ISTR one of the examples in K&R uses this approach somewhere if you want a more concrete example.

--
Andrew Smallshaw
andrews@sdf.lonestar.org

- A
- Ark Khasin
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Sat, Aug 11, 2007 8:00 PM

Lookup table would be by far the simplest and fastest. For maintainability, you can exploit designated initializers for your constant lookup table (the [8]=2007 stuff) if you have C99. If you worry about the size of the table, in many cases it can be compressed, but for that you'd need an external preprocessor or another code (actually, data) generator. That, along with compression, will automate the maintenance.

You may want to take a look at Unimal

formatting link

and in particular App Note

4 there. Sorry for the shameless self-promotion but yours is the sort of a problem that brought Unimal into existence.

-- Ark

- P
- Paul Keinanen
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Sun, Aug 12, 2007 3:09 AM

Unless you are using some Zuse style relay computers, I really do not see the point of optimizing the command line _human_ interface for speed. The user would not notice the difference anyway.

Paul

- C
- CC
  
  Contact options for registered users
Vote on answer
posted
16 years ago

Sun, Aug 12, 2007 2:34 PM

In this case, the interface will also be machine driven, though the serial link limits the speed such that I can't imagine any parser implementation that wouldn't keep up. So the simplest approach that can do the job yet be flexible for possible modifications will suffice. The fastest it ever might have to go is perhaps 1Mbps or so, if we move to a FTDI based USB connection.

The thing that's really hard to foresee at this point is how the command set might evolve.

Good day!

--
_____________________
Christopher R. Carlen
crobc@bogus-remove-me.sbcglobal.net
SuSE 9.1 Linux 2.6.5