Altering WisiToken data structures, take 2
Found another paper; Overbey 2004 - Practical, Incremental, Noncanonical Parsing: Celentano's Method and the Generalized Piecewise LR Parsing Algorithm. That at least discusses incremental generalized parsing, but it gives no details on the syntax tree structure.
Implementing code is also more educational than thinking about it. It proved impractical to keep both the array-based and allocation-based syntax trees, so I just changed it to allocation-based. Then I ran into the problem of finalizing the syntax tree; many nodes are referenced several times, in several different parse streams, so we can't traverse each stream to deallocate nodes. That also means we can't delete the nodes created by a stream when the stream terminates due to a parse error. I ended up adding an array of node access, and using that to deallocate all the nodes when finalizing the tree. In order to delete nodes from terminated streams, we wait until the parse is complete, then copy the final tree from its root into a new tree, and finalize the old one.
The array of nodes for finalization also provides a Node_Index type, useful for debug output. With the array-based syntax tree, we also used Node_Index to create an array of Boolean in a few places, to track whether a node had been translated from EBNF to BNF, or an error for that node had already been reported. With the allocation and stream based syntax tree, Node_Index values become sparse, so an array of boolean is inefficient; we now use a set implemented by a red-black tree.
The simplest way to add generalized parsing to the parse-tree described in Lahav 2004 is to give each parser a complete copy of the stream, since it consumes the terminal nodes when constructing the nonterminals. But that is horribly inefficient; a parser is most likely terminated after parsing just a few tokens. So I added a layer of indirection; Stream_Elements hold pointers to tree nodes, and streams are a doubly-linked list of Stream_Elements. There is one permanent stream holding all the terminals, and each other stream copies the node pointers from those when creating nonterminals.
To create a new stream, we copy the entire stream before the parse point, but not the terminals after the parse point. The tree node pointers are copied, no new tree nodes are created. This will have to change when we finally get to incremental parsing; the stream after the parse point will contain a mix of nonterminals and terminals.
Keeping the terminal stream will facilitate using this structure during error recovery, for push-back. At the moment I'm working on getting the non-error correcting batch-mode parser to work with the new syntax tree structure.
Comments
Post a Comment