Parsers for extensible grammars?


[I probably should direct this at George... :> ]

I'm writing a command parser. Some set of commands are "common". But, other instances of the parser are augmented with additional commands/syntax.

[These additions are known at compile time, not "dynamic"]

Ideally, I want a solution that allows folks developing those "other" commands to just "bolt onto" what I have done. E.g., creating a single "grammar definition" (lex/yacc) is probably not a good way to go (IME, most small parsers tend to be ad hoc).

[Note: I can't even guarantee that the extensions will be consistent or "harmonious" with the grammar that I implement]

A naive approach (?) might be for my code to take a crack at a particular "statement"/command and, in case of FAIL, invoke some bolt-on parser to give it a chance to make sense out of the input. If *it* FAILs, then the input is invalid, etc.

This sounds like an incredible kluge. Any better suggestions?

Reply to
Don Y
Loading thread data ...

By that, I mean: having folks augment a lex/yacc grammar definition that I've developed as the basis of these "common" commands is probably ill-advised. *I* can choose such an implementation for the common commands but should probably not impose that solution on the extensions (and risk folks breaking the common stuff!)

Reply to
Don Y

If it's always something simple like

and the only thing that's changing is the command and (possibly) the argument types (text, numerical, etc.) that are bound to it, then a parser should be pretty easy -- I wouldn't even bother with Lex and Yacc.

Tim Wescott 
Wescott Design Services 
 Click to see the full signature
Reply to
Tim Wescott

You want composable grammars. This is right in a core area of my expertise. Lex/Yacc (and all other forms of bottom-up LR parsing) are the wrong way to go, because the tools must do deep analysis of the entire grammar before producing a parser, and the errors that occur require significant expertise to understand.

Top-down parsers (also known as LL or recursive descent) are much easier to write and debug. They're typically implemented using limited look-ahead (often one token, LL-1) and are slightly less powerful than LR parsers with the same look-ahead.

When an LL parser has unlimited lookahead, they're very powerful but can have horrible performance due to the extensive backtracking required.

However, so-called packrat parsers trade memory for performance, while retaining almost all the power of unlimited backtracking. For a grammar of N rules, the memory usage for an input of size M is O(N*M), that is, linear in the size of the input. Because the CPU performance is so good,there's no need for a separate lexer - the parse performance is the same as the DFAs used for lexing so there's no advantage to be had.

Packrat parsers are generally created from PEGs (Parsing Expression Grammars). There are plenty of PEG parser generators; I maintain the most widely used one for Ruby, but they exist for dozens of languages.

The really nice thing about LL parsers is they're composable. You can take two grammars with different rules, and simply layer one on top of the other. Often it's helpful to allow a rule in one grammar to call the same rule (by name) in the other grammar - this works very like O-O inheritance.

Note that none of this will help you if you want "minimum abbreviation" support for your keywords. That is best implemented using a trie. It's all compatible with the use of a PEG parser though; the keyword matcher won't be written using a PEG, is all.

Clifford Heath.

Reply to
Clifford Heath

The problem I'm stuck with is not knowing what is likely to be added. While I suspect most will be similarly trivial, I can also imagine some esoteric additions!

The "common" commands are trivial enough that, if I knew ALL would be of similar form, I would almost adopt an approach of sequentially attempting to parse a command with individual templates -- until one "fit". Extensions would just be tacked onto the end of a (short) series of conditionals:

if (parse(string, template1)) command1(); else if (parse(string, template2)) command2(); else if ...

Performance isn't an issue (though footprint may be). This would allow extensions to *simply* focus on what they needed to do for *those* individual commands (without fear of extensions mucking up the common commands!)

I wouldn't run the risk of someone buggering a "shared grammar" definition (and failing to add appropriate regression tests to detect that).

But, if the extensions get more esoteric (than this trivial approach), this starts to look toy-ish; there's very little to *leverage*, here... :<

Reply to
Don Y

This sounds like an *ideal* approach! But...

I'm concerned that too many folks would find a "formal" grammar tool to be intimidating.

I'm thinking about the *hundreds* of different "configuration files" I've encountered, "argument handling", etc. and how *few* of these were implemented with any sort of formal tools. Rather, almost as a RULE, they all seemed to be jumbles of spaghetti code (though they "got the job done", more or less).

Cases where "space" was accepted as whitespace -- but "tab", not. Or, vice versa.

Cases where comments HAD to begin in column one and others where leading whitespace was ignored.

Or, where comments could not coexist with preceding "content" on the same line.

Or, where groups of statements had to be co-located on the same line (with some formal delimiter) instead of being allowed to span lines (couple this with a maximum line length constraint and you can see one potential issue!)

And, in each case, no compelling REASON for the restrictions (as if "it was too much work" to provide a cleaner/more consistent implementation).

I'm left with the impression that these folks either underestimated the complexity of the tasks they set out for themselves; or, were intimidated by more "structured" approaches to SUCH A COMMON PROBLEM!

[Or, the structured approaches were harder to comprehend/debug than the ad hoc spaghetti-code implementations!]

So, I figure decoupling my portion of the implementation (common commands) from these potential extensions might be A Wise Choice.

Reply to
Don Y

Or just parallel parsers.

Don said "bolt onto" and "harmonious", but it isn't clear to me whether what he wants is to provide some kind of macro facility in which new commands are fully expressible in *his* language, or whether he wants to allow development of a parallel language that remains interoperable with his but also is separate from it.

The first is [relatively] easy, the second not so much.

In either case, the additional parser should get the first crack at input with Don's own parser as a fallback.


You mean _much_ less powerful. AFAIK, LR has not been shown to be a proper superset of LL, but it is known to be a vastly larger space. The LR(1) subset provably contains LL(1) and also many LL(>1) languages.

That said, a large number of useful languages can be expressed in LL with computationally reasonable lookahead. And it is true in general that LL parsers are easier to understand and debug.

Not really. "Ungetting" tokens - even a large number of them - is not very costly as long as 1) trying a parse produces no side effects [i.e. the predication is purely syntactic], or 2) side effects are implemented "functionally" and can be simply thrown away.

Syntactic predication and quite a lot of useful semantic predication can be implemented using stack or region allocation strategies, but there are some hard spaghetti cases where it's nice to have GC available in your development language.

PEG (and GLR also) replicates the parser state to deal with ambiguous input by following the different paths independently. But the fact is that most language constructs are unambiguous and thus do not cause forking, and *unavoidable* ambiguity in the language tends to be resolved quickly when the ambiguous constructs actually are used.

The problem with PEG (and GLR also) grammars is that there still are no really useful tools for finding/debugging *unintended* ambiguity. The generator tool won't give reduce/reduce errors ... warnings maybe, but not errors ... and will create a parser that just forks on ambiguous input, wanders off into the hills and gets lost.

Most useful program/control languages do not need PEG (or even GLR). If your language really is that complicated, you probably should rethink it.


The trie would be part of the tokenizer, so it would be compatible with any parse strategy.


Reply to
George Neuner

I hadn't considered the idea of expressing "new" commands in terms of "old"/common commands.

What I am looking for is the ability to define a set of commands that will "always" be used/useful. Then, allow others to add to that set of commands to address specific needs that are necessary/unique to their environment. (i.e., the environment in which these common commands are deployed)

Contrived example:


Deploy this in a context where stdin/stdout might want to also support a serial port (" = TTY") would probably necessitate adding:


In the presence of multiple TTY's, it might then require:


Deploy it where stdin/stdout are a VDT and, perhaps:


Deploy it where stdin/stdout are a telnet connection, and perhaps:


I.e., the "arguments" can't be easily foreseen from the perspective of the first/common commands.

Note that I've presented these in a somewhat "harmonious"/consistent manner. Another implementer could have adopted a syntax like:



display WHITE on field of BLUE



I.e., I can't assume he'll consider the form to be appropriate to his/her future needs.

Hmmm... I had considered the opposite approach: let me verify the input is/isn't a "common command" before exposing the input to the "other" parser. The thought being: common commands are then implicitly invariant (you can't change the syntax of them) AND they "always work" (even if the additional parser has bugs in its implementation).

Reply to
Don Y

You could try to bring every command into the ... form with some clever syntax. I'm tempted to say "Lisp", but that probably doesn't enjoy universal love :-) However, you can easily bring if (foo) { bar(); baz(); } else { fred(); } into " ..." form by making "{}" string delimiters. This makes this a command 'if' with four parameters: an expression '(foo)', a string for the 'if' command, an unquoted word 'else', and a string for the 'else' command.

I recently built a command shell like that, and I believe Tcl does it the same way. So all you would need to give your users would be a possiblity to register a command verb, and a way to evaluate strings as command sequences.


Reply to
Stefan Reuther

Who's going to be adding stuff, why, and why won't you have control over the process?

I have an in-house parser that I use for debugging and service on a number of projects. It's OO in C++, but it basically supports a syntax of

You create an object and give it the keyword, which automagically registers it with the first-layer parser. When someone types something in the first-layer parser looks for the keyword, and if its found, hands it to the object for further parsing.

It's crude but effective. I keep telling myself that I'll make more library-ish functionality for parsing the bit, but so far that's all ad-hoc.

Tim Wescott 
Wescott Design Services 
 Click to see the full signature
Reply to
Tim Wescott

Note the (contrived) example I posted (elsewhere this thread) attempts to do that -- by decomposing what might otherwise be "complex" (multivalued) commands into a series of trivial (single parameter) commands.

E.g., easier to accept an 8 digit HEX value for an IP address (and another for a mask) than to accept four dotted decimal values (for each).

But, I can't assume others will take a similar approach. Nor can I imagine the variety of things that they might need/want to address via this interface.


I suspect I will end up with a set of routines that parse specific types of tokens (number, hex number, word, etc.) that the DEVELOPER could call upon in fashioning an ad hoc parser. Perhaps I'm too jaded but I bet that's how additions will end up being implemented -- past observations suggest this.

I think it's probably easier for folks to think in terms of: "OK, now I'll expect an integer in the range of XXX to YYY". Rather than let them naively do something like read the characters into a fixed size buffer (buffer overrun potential!), I can handle SAFELY scanning the next token AS AN INTEGER and report a value that they can just test (plus an error code if the token isn't an integer).

That, at least, gives something that can be leveraged safely.

Reply to
Don Y

Other developers. To address the needs of their environments. Because its not my codebase.

I actually think a viable solution might be as above. I.e., write something that parses for *just* one particular command. If no match, try invoking the thing that parses for the *next* particular command. Lather, rinse, repeat until a match is found (successful parse) *or* no more possibilities (FAIL).

Again, this gets used with very low frequency so it doesn't have to be super efficient. With subroutines/functions that the developer can invoke (because I had to write them to parse the commands that *I* implemented), it could be just a few lines of code extracting parameters from syntactic sugar, etc.

I *don't* want to intimidate a developer into having to embrace some more formal framework with which he may not be comfortable. That way lies bugs.

Reply to
Don Y

To be honest I'm not clear on what that last part means. Adding new "commands" is relatively straightforward but new syntax patterns are surely impossible in the general case without revising the grammar directly.

To clarify that anything you could express inside a C function is probably straightforward but a new language construct such as a new form of loop or a switch statement that accepts non-constant case labels is going to be tricky. Hell in the truly general case even basic stuff such as determining where a statement begins and ends is impossible if arbitrary modifications are permissible.

I did write a parser for the former simpler case a year or two ago: it's easy enough approached from the right direction at the outset. The key isn't so much in the parser but the lexer. In the parser my (slightly simplified) YYSTYPE was something along the lines of:

%union { struct { char *text; } string; struct { char *text; int64_t value; } number; struct { char *text; int (*function)(YYSTYPE *params); } command; }

(If you're wondering why every option has a text argument it's because any token could be converted to the original input if a string is expected in context.)

The lexer then proceeded more or less as normal but only a single pattern matching a generic potential command name i.e. a sequence of all alphabetic characters. The candidate command name was then given to a lookup function that did a binary search of a static array, but obviously how the lookup function works depends on the context.

If no match the lexer returned the token as a string, if there was a match the function pointer is obtained from that array, along with a token code typing its parameters, e.g. a command with two numeric parameters, or a command with a numeric and two string parameters. That approach means that the "plumbing" to add a new command is simply an extra line in the array definition.

Things get a little more complex with optional parameters but it remains practical provided the number of possible combinations remains smallish. The parameter code is what yylex() returns as its result so the parser knows what should follow. In my case yylex() was handwritten, but that was so it could directly access my app's internal representation, there's no reason the same approach couldn't be done in lex.

If you wanted to completely generalise it I suppose you could have a syntax rule to convert any possible parameter token to a single "parameter" term and have a single rule for "command (array of parameters)" but I'd try and avoid that if possible - it moves checking from the parser to each individual command. Still, it's possibly something to have in reserve for oddball commands that don't fit a general pattern.

Andrew Smallshaw
Reply to
Andrew Smallshaw

You and I have talked about this before: it doesn't really matter how simple the interface - the way to think about it is as if you are creating a scripting language in which "operators" are the control functions exposed by the device.

The problem is, that also means the complete set of *functionality* can't be foreseen ... else you would have provided for it already.

Your "core" can be operated with it's provided command set, but you can't easily provide for new commands targeting an environment you know nothing about.

You may want to look up our (ancient) exchange about runtime parser generators. You don't need that here, but the discussion about providing for JIT generated code to link to your device API is relevant.

You don't want to "front end" the extensions ... if one of your commands is redefined, it won't work if you grab it first.


Reply to
George Neuner

If all else fails, strstr() keywords, then have a corresponding parsers for each keyword.

A better approach is a container (struct) of keywords or arguments ( if it's not a keyword, it's an argument ) , an integer for the index of keywords in s string table, and an index for the position of the token within the command.

Extensibility should not be difficult.

Les Cargill
Reply to
Les Cargill

Hi Don,

Have you looked at Forth? What you are describing sounds like the problem that Forth solved back in 1968. It is the sort of thing that Forth programmers have been doing for nearly 50 years and it works very successfully for us.

See and

Paul E. Bennett IEng MIET..... 
 Click to see the full signature
Reply to
Paul E Bennett

Once upon a time, realized that if one uses distinct left and right operato r precedence values one can include brackets as operators (high outside pre cedence and matching low inside precedence). Never figured out how to pars e an operator that had simultaneous infix, prefix and suffix versions. The appeal is that the parser is table driven and that the table is simple and easy to amend. Am told that this approach has limitations and has been tried.

Reply to

Exactly. Because I can't anticipate the functionality that will be needed in the future. E.g., the contrived example.

Yes. And, I can't "guess" to the types of data/actions that may need to be "encoded" in that command set.

Nor can I ensure the next guy will have the same approach to the problem -- half the commands may "feel like mine" while the other half "feel like his/hers". And, neither group "feels consistent" with the other.

Like me using infix operators and he using RPN.

Can't. The (forced) "computer upgrade" cost me my mail archive (I don't leave mail stored on server :< )

That was exactly the point: that the common commands are *common* and always behave exactly the same. (as well as "always working" regardless of how much the other guy mucks with the interface).

E.g., if "run" was a common command and he wanted to "overload" it to add/alter functionality, he'd have to create a new command ("walk_very_very_quickly") to bind to the actions he desires. Perhaps he wants to turn on a big red light prior to activating the mechanism ("run"-ing). If I let him redefine "run" to do this, I risk him forgetting to verify safety mechanisms are in place (which other users would ASSUME when invoking "run").

It's a tough call -- assume competence in EVERYTHING? Or, just assume competence in their knowledge of their "extensions"??

Reply to
Don Y

I am leary of taking things in a different direction from what is (and has been) expected in the codebase. IIRC, someone did something like this as part of an early FreeBSD release (?). And, I think SPARCs have a Forth-based "Open Firmware". So, it's not a "revolutionary" concept that I'd have to expend effort defending...

Admittedly, it lets you (the user) do lots of things that you wouldn't be able to even consider in, e.g., a conventional BIOS. (And, would be far more "future safe" from my point of view).

I'll have to think real hard on that. If the "typical" sorts of things "look mildly familiar", then it might be possible to slip it in under the radar (without religious issues biasing the decision/acceptance). OTOH, if things start to look too funky, then it just gives people an excuse to dislike it. :-/



Reply to
Don Y

The ideal would be to write a usage() for each command -- to be displayed when "help" is invoked. Then, algorithmically extract the parse information from the example syntax present in that usage() presentation.

But, I think that's too lofty a goal. It would, by necessity, make the usage() very verbose -- to qualify all the restrictions on the tokens in the example syntax (unless the code was smart enough to recognize common constructs and document them elsewhere).

"Small" tends to preclude much of that added smarts.

Reply to
Don Y

ElectronDepot website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.