journal.md - mozsearch

## What I learned, what I wonder

### Stab 5 (simple LR, LR(1), then LALR(1))

Well. I learned enough to implement this, although there is still much I

don't understand.

I learned a bit about what kind of phenomenon can render a grammar

outside XLL(1) (that is, LL(1) as extended by automated left-factoring

and left-recursion elimination); see `testFirstFirstConflict` in

`test.py` for a contrived example, and `testLeftHandSideExpression` for

a realistic one.

I learned that the shift-reduce operator precedence parser I wrote for

SpiderMonkey is even less like a typical LR parser than I imagined.

I was stunned to find that the SLR parser I wrote first, including the

table generator, was *less* code than the predictive LL parser of stab

4. However, full LR(1) took rather a lot of code.

I learned that I will apparently hand-code the computation of transitive

closures of sets under relations ten times before even considering

writing a general algorithm. The patterns I have written over and over

are: 1. `while not done:` visit every element already in the set,

iterating to a fixed point, which is this ludicrous O(*n*<sup>2</sup>)

in the number of pairs in the relation; 2. depth-first graph walking

with cycle detection, which can overflow the stack.

I learned three ways to hack features into an LR parser generator (cf. how

easy it is to hack stuff into a recursive descent parser). The tricks I

know are:

1.  Add custom items. To add lookahead assertions, I just added a

    lookahead element to the LRItem tuple. The trick then is to make

    sure you are normalizing states that are actually identical, to

    avoid combinatorial explosion—and eventually, I expect, table

    compression.

2.  Add custom actions. I think I can support automatic semicolon

    insertion by replacing the usual error action of some states with a

    special ASI actions.

3.  Desugaring. The

    [ECMAScript standard](https://tc39.es/ecma262/#sec-grammar-notation)

    describes optional elements and parameterized nonterminals this way,

    and for now at least, that's how we actually implement them.

There's a lot still to learn here.

*   OMG, what does it all mean? I'm getting more comfortable with the

    control flow ("calls" and "returns") of this system, but I wouldn't

    say I understand it!

*   Why is lookahead, past the end of the current half-parsed

    production, part of an LR item? What other kinds of item

    embellishment could be done instead?

*   In what sense is an LR parser a DFA? I implemented it, but there's

    more to it that I haven't grokked yet.

*   Is there just one DFA or many? What exactly is the "derived" grammar

    that the DFA parses? How on earth does it magically turn out to be

    regular?  (This depends on it not extending past the first handle,

    but I still don't quite see.)

*   If I faithfully implement the algorithms in the book, will it be

    less of a dumpster fire? Smaller, more factored?

*   How can I tell if a transformation on grammars preserves the

    property of being LR(k)? Factoring out a nonterminal, for example,

    may not preserve LR(k)ness. Inlining probably always does.

*   Is there some variant of this that treats nonterminals more like

    terminals? It's easy to imagine computing start sets and follow sets

    that contain both kinds of symbols. Does that buy us anything?

Things I noticed:

*   I think Yacc allows bits of code in the middle of productions:

        nt1: T1 T2 nt2 { code1(); } T3 nt3 T4 { code2(); }

    That could be implemented by introducing a synthetic production

    that contains everything up to the first code block:

        nt1_aux: T1 T2 nt2 { code1(); }

        nt1: nt1_aux T3 nt3 T4 { code2(); }

    There is a principle that says code should happen only at the end of

    a production: because LR states are superpositions of items. We

    don't know which production we are really parsing until we reduce,

    so we don't know which code to execute.

*   Each state is reachable from an initial state by a finite sequence

    of "pushes", each of which pushes either a terminal (a shift action)

    or a nonterminal (a summary of a bunch of parsing actions, ending

    with a reduce).

    States can sometimes be reached multiple ways (it's a state

    transition graph). But regardless of which path you take, the symbols

    pushed by the last few steps always match the symbols appearing to

    the left of point in each of the state's LR items. (This implies

    that those items have to agree on what has happened. Might make a

    nice assertion.)

### Stab 4 (nonrecursive table-driven predictive LL parser)

I learned that testing that a Python program can do something deeply

recursive is kind of nontrivial. :-\

I learned that the predictive parser still takes two stacks (one

representing the future and one representing the past). It's not magic!

This makes me want to hop back to stab 3, optimize away the operand

stack, and see what kind of code I can get.

It seems like recursive descent would be faster, but the table-driven

parser could be made to support incremental parsing (the state of the

algorithm is "just data", a pair of stacks, neither of which is the

parser program's native call stack).

### Stab 3 (recursive descent with principled left-recursion-elimination and left-factoring)

I learned how to eliminate left recursion in a grammar (Algorithm 4.1

from the book). I learned how to check that a grammar is LL(1) using

the start and follow sets, although I didn't really learn what LL(1)

means in any depth. (I'm just using it as a means to prove that the

grammar is unambiguous.)

I learned from the book how to do a table-driven "nonrecursive

predictive parser". Something to try later.

I came up with the "reduction symbol" thing. It seems to work as

expected!  This allows me to transform the grammar, but still generate

parse trees reflecting the source grammar. However, the resulting code

is inefficient. Further optimization would improve it, but the

predictive parser will fare better even without optimization.

I wonder what differences there are between LL(1) and LR(1) grammars.

(The book repeatedly says they are different, but the distinctions it

draws suggest differences like: left-recursive grammars can be LR but

never LL.  That particular difference doesn't matter much to me, because

there's an algorithm for eliminating left recursion.)

### Stab 2 (recursive descent with ad hoc immediate-left-recursion-elimination)

I learned it's easy for code to race ahead of understanding.

I learned that a little feature can mean a lot of complexity.

I learned that it's probably hard to support indirect left-recursion using this approach.

We're able to twist left-recursion into a `while` loop because what we're doing is local to a single nonterminal's productions,

and they're all parsed by a single function.

Making this work across function boundaries would be annoying,

even ignoring the possibility that a nonterminal can be involved in multiple left-call cycles.

I wonder if the JS spec uses any indirect left-recursion.

I wonder if there's a nice formalization of a "grammar with actions" that abstracts away "implementation details",

so that we could prove two grammars equivalent,

not just in that they describe the same language,

but equivalent in output.

This could help me explore "grammar rewrites",

which could lead to usable optimizations.

I noticed that the ES spec contains this:

> ### 13.6 The if Statement

> #### Syntax

> ```

> IfStatement[Yield, Await, Return]:

>     if ( Expression[+In, ?Yield, ?Await] ) Statement[?Yield, ?Await, ?Return] else Statement[?Yield, ?Await, ?Return]

>     if ( Expression[+In, ?Yield, ?Await] ) Statement[?Yield, ?Await, ?Return]

> ```

> Each `else` for which the choice of associated `if` is ambiguous shall

> be associated with the nearest possible `if` that would otherwise have

> no corresponding `else`.

I wonder if this prose is effectively the same as adding a negative lookahead assertion

"[lookahead &ne; `else`]" at the end of the shorter production.

(I asked bterlson and he thinks so.)

I wonder if follow sets can be usefully considered as context-dependent.

What do I mean by this?

For example, `function` is certainly in the follow set of *Statement* in JS,

but there are plenty of contexts, like the rule `do Statement while ( Expression ) ;`,

where the nested *Statement* is never followed by `function`.

But does it matter?

I think it only matters if you're interested in better error messages.

Follow sets only matter to detect ambiguity in a grammar,

and *Statement* is ambiguous if it's ambiguous in *any* context.

### Stab 1 (very naive recursive descent)

I learned that if you simply define a grammar as a set of rules,

there are all sorts of anomalies that can come up:

*   Vacant nonterminals (that do not match any input strings);

*   Nonterminals that match only infinite strings, like `a ::= X a`.

*   Cycles ("busy loops"), like `a ::= a`.

    These always introduce ambiguity.

    (You can also have cycles through multiple nonterminals:

    `a ::= b; b ::= a`.)

These in particular are easy to test for, with no false positives.

I wonder if there are other anomalies,

and if the "easiness" generalizes to all of them, and why.

I know what it means for a grammar to be *ambiguous*:

it means there's at least one input with multiple valid parses.

I understand that parser generators can check for ambiguity.

But it's easiest to do so by imposing draconian restrictions.

I learned the "dangling `else` problem" is an ambiguity in exactly this sense.

I wonder if there's a principled way to deal with it.

I know that a parse is a constructive proof that a string matches a grammar.

I learned that start sets are important even in minimal parser generators.

This is interesting because they'll be a bit more interesting to compute

once we start considering empty productions.

I wonder if it turns out to still be pretty easy.

Does the start set of a possibly-empty production include its follow set?

(According to the dragon book, you add epsilon to the start set in this case.)

### Nice grammars

I learned that the definition of a "grammar"

as a formal description of a language (= a set of strings)

is incomplete.

Consider the Lisp syntax we're using:

```

sexpr ::= SYMBOL

sexpr ::= "(" tail

tail ::= ")"

tail ::= sexpr tail

```

Nobody wants to parse Lisp like that.

There are two problems.

One is expressive.

The `"("` and `")"` tokens should appear in the same production.

That way, the grammar says declaratively: these marks always appear in properly nesting pairs.

```

sexpr ::= SYMBOL

sexpr ::= "(" list ")"

list ::= [empty]

list ::= sexpr list

```

The other problem has to do with *what you've got* when you get an automatically generated parse.

A grammar is more than just a description of a language,

to the extent we care about the form of the parse trees we get out of the parser.

A grammar is a particular way of writing a parser,

and since we care about the parser's output,

we care about details of the grammar that would be mere "implementation details" otherwise.

Source code

Revision control

Copy as Markdown

Other Tools