lex/yacc efficiency tradeoffs

- G
- George Neuner
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Wed, Aug 25, 2010 11:47 PM

Not really ... pull vs push parsing is more like the difference between function call-return and CPS: the control path is inverted.

In a push design the lexer is fed characters one at a time and tries incrementally to match its partial input against the recognizer set. When a valid token is identified, that token is fed to the parser which tries incrementally to match partial token sequences against its rule set. When a rule is matched, the parser executes the associated user code as is usual.

The only additional complexity vs the normal pull design is that the lexer and parser have to remember the current state of their FSM between activations. Because bison/flex FSM are table driven, in practice that amounts to just a few integer values.

WRT memory use, a push lexer doesn't need any character buffering at all unless it must pass a matched string on to the parser. Input stream buffering (if any) is normally done by the surrounding program code. Character strings passed from lexer to parser need to be heap allocated, but the same is generally true in pull designs. Overall the push design can use less memory.

As far as catching errors early, the lexer is essentially fed one character at a time ... if it can tell that the character is out of place for a particular recognizer context, it can signal the error immediately (either directly or by passing an error token to the parser).

Yacc and lex are severely outdated - they now are kept only for backward compatibility. Bison and flex now are standard issue on nearly every Unix/Linux system. Besides which, anyone likely to understand an LALR grammar well enough to maintain/modify the code likely already knows about them.

Unless the language[*] is simple or the target platform extremely limited, a hand-crafted parser is almost never the right way to go. Good CCG tools like bison, Yacc++, ANTLR, LLgen, etc. have sound theory behind them and their results are predictable and reproducible. For non-trival grammars, generated parsers are likely to be both smaller and faster than most people could hand craft.

[*] a comm protocol is a language just as is a turing complete programming language. the only difference is complexity.

Lexers are a different story. Lexers are relatively easy and when the set of recognizers is limited, table driven regex can be improved upon by hand coding. Quite a few real-world systems use a generated parser with a hand coded lexer.

Most use of the CPU stack is from user added code. FSM use in both tools is actually very modest as the generated code is basically a loop containing a big switch statement - there is no recursion. Yacc's PDA stack is optionally a static array or a heap allocation and each stack element is quite small: a couple of integer indexes and a "token" type (which by default is an integer but can be a user defined union or structure).

Push parsers/lexers created by bison/flex are just a big switch statements - they don't need the control loop.

I'm not sure what you mean by that - yacc/bison and (f)lex don't have any "configuration" problems that aren't part of any multiple input file project. However, there are better tools that will generate parser and lexer from a single grammar file.

I still think you should look into using a push parser.

George

- D
- D Yuniskis
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Thu, Aug 26, 2010 3:13 AM

Yes, this is top down vs. bottom up (in Gries-speak).

A bottom up parser tries to "promote" (Gries would say "reduce") the terminals to nonterminals with the eventual (ultimate) goal of reducing them to the "distinguished symbol" (e.g., "S" or "G" or whatever you use to represent a "valid 'sentence'" in that grammar).

A top down parser starts at the distinguished symbol and tries to come up with a set of non terminals that map onto the input stream of terminals.

In the bottom up case, if a character ("terminal") can be promoted (reduced) to *any* nonterminal, then the parser has to accept it IN CASE that nonterminal might eventually be recognized and promoted (reduced) to yet another nonterminal and, ultimately, to the distinguished symbol (WIN). Only when the parser can determine that the character *doesn't* allow a promotion to the "distinguished symbol" can the input stream be considered "in error" (i.e., can't get a valid sentence from this input).

In the top down case, as long as the input (stream) continues to match *some* possible set of terminals (and nonterminals) that can be derived *FROM* the "distinguished symbol", parsing is "on track". I.e., "so far, this input stream *could* potentially form a valid sentence". OTOH, s soon as a terminal is encountered that *doesn't* fit a derivation from the distinguished symbol, *everything* up to this point must be considered as bogus.

Assume S ::= VERB NOUN VALUE? where VERB and NOUN are specific "words" and VALUE is a "number". (S is the "distinguished symbol" for this grammar)

In practical terms, if I see a set of digits, I keep gobbling them up as they can represent a "number" (VALUE in my example). Once I see the end of that sequence of digits, I can promote this to a VALUE. But, then I realize that "VALUE isn't valid in this 'context'" (surroundings). So, my error is detected late instead of early.

OTOH, if you proceed top down, you expand the (single) rule for the distinguished symbol (S) to get VERB NOUN VALUE? You then expand VERB and find that the *first* character in your input stream (a digit!) doesn't satisfy the (single!) rule for VERB (a string of letters, for example). So, you immediately know the "digit" represents a syntax error without waiting around to accumulate more "digits"

Exactly. Communication protocols tend to be very narrowly defined. E.g., cc(1) will gladly interpret: "123.000000000000000000 + -456.000000000000" whereas conveying that same *information* could be done with "123 + -456" (assuming the protocol implicitly promotes to floats)

I'm hoping to use *one* set of tools to implement protocols on a range of devices from resource rich "hosts" to resource *starved* "motes". The appeal is significant since it lets protocols et al. be defined and implemented before target hardware is selected. However, if the resulting code ends up placing a burden that effectively

*requires* a specific class of processor just to parse messages, then the cost will exceed the benefit.

Look at your comments again with "I have a little 8b micro" in mind. Ask yourself how much of that processor's resources you want to set aside a priori for lex/yacc's needs. Then, ask yourself what you could do "by hand".

No, I meant configuring the *runtime* that they generate. E.g., so that UNCONSTRAINED input doesn't cause them to crash; so they can handle the intended syntax accurately yet without bloat, etc.

E.g., if I *know* all "VALUE"s are 4 "DIGIT"s and DIGITs are never used elsewhere, I (writing a hand-crafted parser) can set aside a single "short" and parse the VALUE directly into it -- without ever having to track the individual DIGITs that comprise those VALUEs (in case they might be used in other nonterminals).

As I said in previous post, I'll put together an example that points out each of these items and puts real numbers on them so you can better see the "costs" involved. For even trivial grammars, they are significant! :<

- G
- George Neuner
  
  Contact options for registered users
Vote on answer
posted
13 years ago

Thu, Aug 26, 2010 9:51 PM

This proves that while you understand some of the theory, you clearly do not have a firm grasp on the technology.

What you are describing is the difference between "sequence" matching (also called "tuple" matching) and "predictive" matching. Either form of matching can be applied to either parsing or to lexing (tokenizing) ... and you are improperly conflating these levels.

All available LR tools - yacc, bison, etc. - are bottom-up parallel sequence match parsers that use a finite state machine implementation (almost all use a ND-PDA) ... the parser accumulates tokens (context) as long as the set of partial rule matches is non-empty and reduces (executes) a rule when it can be exactly matched to the tail of the sequence.

All available LL tools - ANTLR, llgen, pccts, etc. - are top-down sequential predictive parsers that use a recursive decent coded implementation ... the parser chooses a rule to pursue based on the current "look-ahead" token(s) (the next to be input). Predictive parsing has trouble when multiple rules have common prefixes - unlike sequence matching, predictive matching has no viable FSM implementation for parallel matching, so common prefixes have to be handled by a) refactoring the grammar, b) backtracking (trying each alternative), c) increasing look-ahead (syntactic predication), d) forking the parser, or e) all of the above.

"LR" and "LL" refer to Chomsky language groupings and to different ways of decomposing the input. It is logically incorrect to lump "LL" with predictive parsing and "LR" with sequence parsing ... the reasons for these associations are historical and are based on the most efficient ways known to implement them on serial computers. There is no technical reason why a LR parser can't be predictive and recursive decent coded or why a LL parser can't be parallel and FSM coded ... they aren't because those implementations would be (much) less efficient than the ones now used.

There are both top-down and bottom-up lexing (tokenizing) tools. (f)lex is a bottom-up sequence matching FSM implementation based on regular expressions. ANTLR is a top-down predictive recursive decent code implementation based on EBNF plus regular expressions. There are a number of other lexer tools (mostly top-down) that use EBNF for specifying patterns.

================================

What I was talking about with "pull" vs "push" is control flow. The conventional pull case is diagramed this way:

application 'context'" (surroundings). So, my error is detected

Sigh. For the, I think 3rd, time, this has NOTHING to do with parsing strategy ... it is a tokenization, i.e. lexer, issue.

All that is required to catch syntax errors at the character level is for the parser to provide the needed context information so that the tokenizer can make informed choices.

If you really want/need to do this then (f)lex probably is the wrong tool to use and you may want to use a predictive tokenizer. But the choice of lexing strategy has absolutely zero bearing on the choice of parsing strategy ... and not a whole lot of bearing on the choice of parser tool either given that most expect a separate lexer and will work with anything that obeys their interface convention. (There are a few tools which expect the parser will directly control the input stream, but none of them are terribly popular).

George