Altering current WisiToken data structures

We must decide how to alter the current WisiToken data structures to accommodate incremental parsing. (I'll use "we" in similar contexts, because that's the style in academic papers (it refers the reader in collaboration with the author). But there's no one else involved here, just me).

Another useful paper: Lahav 2004 - Efficient Semantic Analysis for Text Editors
That describes a "kmn-list" for representing changes to a source file. We'd eventually like to support the Language Server Protocol; that represents a text change by an affected range and the inserted text (https://microsoft.github.io/language-server-protocol/specifications/specification-current/#textDocument_didChange).

Lahav 2004 also describes a P-Tree, which is an enhanced parse tree used for incremental parsing. It is a list of partial parse trees, which can range from the list of tokens produced by the lexer, to a single fully parsed tree. Each P-Tree node is allocated from memory separately; the links between nodes are memory addresses (pointers). This minimizes the amount of node manipulation required when a node is reused during incremental parse. There is no separate structure holding the terminal tokens; the parser retrieves tokens from the P-Tree.

Wagner, Graham 1998 discusses balancing the syntax tree to make it more efficient to break down at a change point. This is needed for list structures; for example, given the productions:

statement : loop_statement | if_statement | ... ;

statement_list : statement | statement_list statement ;

The syntax tree for a sequence of 4 statements looks like:

root
    statement_list
        statement_list
            statement_list
                 statement -- 1
            statement -- 2
         statement --3
    statement -- 4

This is more a linked list than a tree, so it is closer to O(n) time to breakdown, where n is the number of nodes. To preserve O(lg (n)) time, Wagner, Graham 1998 describe a rebalancing process (similar to rebalancing a red-black tree).

Another choice would be to represent such a list as the children of a single node:

root
    statement_list
        statement -- 1
        statement -- 2
        statement -- 3
        statement -- 4

The child list could be a red-black tree instead of a linear list. This has other advantages when navigating in the tree after parsing.

WisiToken uses two data structures; a terminal token array, and a Syntax_Tree. The leaf nodes of the syntax tree contain the array index of the terminal token.

WisiToken implements a Syntax_Tree with a dynamically sized array of nodes. Links between nodes are represented by array indices of a Node_Index type. This provides fast access to arbitrary nodes, while minimizing memory allocation. However, no actual measurement of the affect of this design on parse speed was done.

One advantage of an integer Node_Index is in debugging; the node index is stable between repeated runs in a debugging session, while the memory address of an allocated node will change.

WisiToken supports generalized parsing; parallel parsers are spawned whenever an ambiguity in the grammar is encountered (represented by a conflict in the parse table). The parsers proceed in parallel, typically going to different states and reducing to different nonterminals for each terminal token from the input stream. Typically, conflicts are quickly resolved (within a few tokens); some parsers terminate on an error, the remaining parsers all reach the same state, with the same parse stack. Then only one parser continues until the next ambiguity is encountered. The WisiToken syntax tree supports branching; nodes that are shared by all parsers are in one array, while nodes specific to one parser are in a separate array. Each parser can have a different branch point; the last node in the shared array that it uses. When a new parser is spawned, a new branch is created. If the parser that encountered the conflict already had a branched tree, the new branch is a copy of the existing branch; otherwise the new branch is empty. As parsing proceeds, new nodes are created in the branched array. When the conflict is resolved, the successful parser appends its branched syntax tree node array to the shared node array, and parsing continues using only the shared array.

While parsing with a branched tree, any node in the shared node array that is modified must be copied to the branched node array. Because we use node index to determine whether the node is in the branched or shared array, all nodes between the copied node and the current last shared node must also be copied. If we set the Parent link in nodes, that is a modification. That results in most of the shared tree being copied into each branched tree, since the terminal nodes are all put into the tree first, before any parsing is done. Therefore WisiToken does not set the parent link until after parsing is done; the parent link is not needed for parsing, only for navigating the tree after parsing is done.

A parse tree implemented by allocating each node will have a similar problem; any lower-level nodes that are modified by a branched parser must be copied. Not setting the parent link would work; so far, it seems that is the only problematic node field in an incremental parser.

Changing the syntax tree data structure to individually allocated nodes requires changing the Node_Index type from an integer to a pointer. This should be transparent to most code outside the Syntax_Tree package.  However, in a few places the node index is reused as an index to an array of Booleans, to indicate that a node has already been processed in some way. If Node_Index is a pointer, that Boolean would have to be stored in the node, which means a tree traversal is required to reset it.

Eliminating the array of terminals is more problematic - the array index is of Token_Index type, and is used to indicate the ordering of the tokens in the text. In error recovery, tokens are inserted before a Token_Index, or deleted at a Token_Index. If we no longer have the token array, we no longer have Token_Index. But in incremental parsing, Token_Index would no longer indicate token ordering anyway - tokens read during an edit would be added to the end of the array. They could be inserted in the middle, but that is an O(n) operation, which we are trying to avoid. On the other hand, all tokens after the edit must have their text position adjusted, so the two operations could be combined, and traversing an array to adjust text position is faster than traversing a tree. The alternative is to use the text position of tokens to indicate ordering.

Another constraint on the data structures is the error correction algorithm (Leake 2020). It requires efficient access to the terminal token stream for push-back and experimental parse, and to the syntax tree at the error point for undo-reduce. Changing to a P-Tree structure would be ok, as long as terminal token access is sufficiently fast, which must be measured (it's not likely to be as fast as a direct array access).

Another change - non-grammar tokens, such as newline and comment, must be updated by the incremental lexer. Currently, WisiToken passes those tokens to the user function Lexer_to_Augmented and then forgets about them. Now they will have to be stored along with the grammar tokens.

Neither Wagner Graham 1998 nor Lahav 2004 discuss generalized parsing, so we have to look elsewhere for advice on data structures for generalized incremental parsers.

Tree-sitter is an existing incremental generalized parser (unfortunately implemented in C, and there are almost no comments in the code). It uses a complex tree structure, possibly optimized for memory storage. It apparently uses some sort of versioned stack for parallel parsing.

Lezer is another existing incremental parser, written in Typescript - a language I've barely heard of, but it's not hard to understand. It uses a complex tree structure, apparently optimized to reduce memory use. It uses copies of the parse stack to implement parallel parsing.

Given the above, I'll implement a new Syntax_Tree type for WisiToken, using allocated nodes, and storing all terminals only in the tree. I'll avoid any attempt at optimizing (including rebalancing) until it is working for significant use cases; then I can measure the affect of various optimization choices. One alternate design to explore is an array of nodes with a free list; that has the advantages of an integer node_index, and the advantage of less dynamic allocation. I'll use a new abstract root type for the Syntax_Tree API, so both types of tree are available; we can use the old type for batch parsing, or for generating a known good tree for testing.

Comments

Popular posts from this blog

Altering WisiToken data structures, take 2

Just getting started