Posts

Starting on incremental parser

Edit_Tree is now implemented; it takes an existing syntax tree (which is the result of a batch parse), and applies a list of edits described by a KMN list as in Lahav 2008. The result is a parse stream consisting of a mix of terminals and subtrees. The next step is to implement the incremental parser. Wagner, Graham 1998 gives a better algorithm for this than Lahav 2008; in fact, it incorporates Edit_Tree, although it does not give as much detail. As a first step, I'll keep Edit_Tree separate.  WisiToken provides a generalize parser; it handles conflicts in the parse table by spawning parallel parsers. Wagner, Graham 1998 does not include that; it assumes a single parser. So for the first step, it will abort on conflicts; the first test cases will not encounter grammar conflicts. It will also abort for all syntax errors; I'll add error recovery later. The algorithm in Lahav 2008 is significantly different from Wagner, Graham 1998; Lahav's Left_Breakdown changes the tree str...

Update the Ada parser

The WisiToken error recovering parser was not hard to update to use the new syntax tree; once that was working, I started updating the Ada parser to use it. I took a detour here; Ada 2020 is almost out, so I added the new syntax to the Ada parser. Along the way I found a better way to implement the indent engine. That's all working now; the ada-mode tests all pass. So back to incremental parse. Lahav 2004 describes an "incremental scanner"; only the tokens that actually require scanning due to text changes are scanned. The other tokens following any change are adjusted to reflect the change in position. WisiToken uses the term "lexer" instead of "scanner"; it uses the re2c lexer. The lexing process must modify the existing terminal stream in the syntax tree, rather than creating a new one; so it makes sense to also modify the parse tree at the same time, breaking it down where it must change.

Altering WisiToken data structures, take 2

Found another paper; Overbey 2004 - Practical, Incremental, Noncanonical Parsing: Celentano's Method and the Generalized Piecewise LR Parsing Algorithm. That at least discusses incremental generalized parsing, but it gives no details on the syntax tree structure. Implementing code is also more educational than thinking about it. It proved impractical to keep both the array-based and allocation-based syntax trees, so I just changed it to allocation-based. Then I ran into the problem of finalizing the syntax tree; many nodes are referenced several times, in several different parse streams, so we can't traverse each stream to deallocate nodes. That also means we can't delete the nodes created by a stream when the stream terminates due to a parse error. I ended up adding an array of node access, and using that to deallocate all the nodes when finalizing the tree. In order to delete nodes from terminated streams, we wait until the parse is complete, then copy the final tree from...

Bibliography

I'll maintain a bibliography list in this post (alphabetical by author or web site name): papers/books: Lahav 2004 - Efficient Semantic Analysis for Text Editors     Describes some implementation details for a parser based on Wagner, Graham 1998 Leake 2020 - Robust error correction in a generalized LR parser     Describes the error correction algorithm used in WisiToken Overbey 2004 - Practical, Incremental, Noncanonical Parsing: Celentano's Method and the Generalized Piecewise LR Parsing Algorithm      No details on data structures. Wagner, Graham 1998 - Efficient and Flexible Incremental Parsing     T horough discussion of incremental parsing, but does not discuss generalized parsing. Web sites: Lezer  - incremental generalized parser/generator, written in Typescript Tree-sitter - incremental generalized parser/generator; generator written in Rust, parser in C

Daydreams about incremental vs partial parse

WisiToken currently supports "partial-parse"; Emacs passes only part of a file to the parser, in order to speed up the response. Emacs ada-mode uses regular expressions to find the start and end point of the parse region. This works fairly well, but not as well as I'd like, which is the motivation behind implementing incremental parse. Incremental parse is initially intended to replace partial-parse, but it might be useful to keep both. "Standard" incremental parse requires a full parse when the file is first opened; that will be my initial approach. Some of my customers report that the most common use case is navigating in files, to understand the code; navigation typically requires a full file parse anyway. I've daydreamed about relaxing that; using the current partial parse initially, and extending the parsed region on demand. There could be several disjoint parsed regions in a file (that's supported now by the wisi Gnu ELPA package; it does not requi...

Altering current WisiToken data structures

We must decide how to alter the current WisiToken data structures to accommodate incremental parsing. (I'll use "we" in similar contexts, because that's the style in academic papers (it refers the reader in collaboration with the author). But there's no one else involved here, just me). Another useful paper: Lahav 2004 - Efficient Semantic Analysis for Text Editors That describes a "kmn-list" for representing changes to a source file. We'd eventually like to support the Language Server Protocol; that represents a text change by an affected range and the inserted text (https://microsoft.github.io/language-server-protocol/specifications/specification-current/#textDocument_didChange). Lahav 2004 also describes a P-Tree, which is an enhanced parse tree used for incremental parsing. It is a list of partial parse trees, which can range from the list of tokens produced by the lexer, to a single fully parsed tree. Each P-Tree node is allocated from memory se...

Just getting started

I'm just getting started adding incremental parse to WisiToken , so I decided to try writing a blog about the process. Maybe people will be interested in this experience. First step; literature search, mostly in Google scholar. The best paper seems to be Wagner, Graham 1998 Efficient and Flexible Incremental Parsing . There are many papers that cite that one, but none seems to improve on it much. Next step: add a test to the WisiToken test set. It won't be very elaborate; just a string to lex and parse, an edit to the string, and expected results - a token sequence and some representation of the syntax tree. I'll add cases as the code progresses.