Thursday, October 1, 2009

Parser design

The Lexer used a sort of trie-lite to do its work. The parser will be driven by a full-blown trie. Here's a graphviz diagram of it:


As you can see, the tag markup structure, which looks like
[thing]
is the most overloaded, but that's Huffman coding for you :)

Anyhow, the key to understanding this diagram is knowing that everywhere you see "printable", "whitespace", or "hex", that means "any number of these characters, consecutively". The only exception is "printable-1", which is read "one printable character".

That and the unicode entity stuff look like they create some really hairy exceptions, but the truth is that due to the way the reference implementation (spefically, the Lexer) operates, it'll all come in as one wad. A simple regex, combined with the lookahead token, will tell if it's a entity or not. It's shown exploded here, but those subtrees may be optimized completely out of the live trie.

No comments: