Thursday, October 1, 2009

Parser design

The Lexer used a sort of trie-lite to do its work. The parser will be driven by a full-blown trie. Here's a graphviz diagram of it:


As you can see, the tag markup structure, which looks like
[thing]
is the most overloaded, but that's Huffman coding for you :)

Anyhow, the key to understanding this diagram is knowing that everywhere you see "printable", "whitespace", or "hex", that means "any number of these characters, consecutively". The only exception is "printable-1", which is read "one printable character".

That and the unicode entity stuff look like they create some really hairy exceptions, but the truth is that due to the way the reference implementation (spefically, the Lexer) operates, it'll all come in as one wad. A simple regex, combined with the lookahead token, will tell if it's a entity or not. It's shown exploded here, but those subtrees may be optimized completely out of the live trie.

Sunday, September 13, 2009

Message passing

I'm working on a primitive message-passing mechanism to allow objects to request that other objects execute methods on demand. Sort of a remote procedure call, you might say. The basics are as follows:
  1. Every object which inherits from Carrot::Universal has a 'children' slot and a 'rpc_ok' slot. The first stores a reference to every object that the present object "owns", or has created. The second is a list of methods in the present object which are allowed to be called by this mechanism.
  2. There's a 'pass' method, which creates a standard Carrot::Msg object and uses it as an argument to the 'catch' method of each object in the present object's 'children' slot.
  3. The message created by 'pass' has a hashref instead of a string as its content. This hashref contains
    1. The caller object's classname
    2. The intended recipient class's name (will be matched as an ending regex, not a strict string equivalence, so 'Lexer' will work as well as Carrot::Lexer)
    3. The method to be called
    4. Arguments to the method
  4. If the recipient classname matches the name of the called object, and if the given method is on the recipient's 'rpc_ok' list, then thy will be done.
  5. Matching or not, the recipient object will pass the message on to all of its children.
Why? This is the mechanism that lets (definitely some, possible all of) the pragmas work. In the new design, the level at which pragmas are recognized (the Parser) is, for instance, 2 levels removed from where markup characters are tokenized as such (the Tokenizer). The possible solutions were
  1. to implement a custom pass-along method for EVERY method that ends up being at greater than one remove from where it ends up being needed
  2. to pass handles to objects to all other objects (AKA "the last design, to an extent")
  3. to implement a generic message passing mechanism
The last one seems cleanest, most maintainable, and most useful in the long-term to me.

Wednesday, September 9, 2009

The Way Forward

I don't mean how to complete Carrot. Though that is distressingly far from a fait accompli, I have somehow managed to convince myself that I'm capable of cracking all the remaining problems -- if by no other mechanism, then my letting it be ugly where it must in order to be done.

What I've been thinking of lately is the larger problem; the one that comes after Carrot; the one which Carrot is a gateway to solving. The archival and publishing problem. I've been thinking on it for years, and today I got a flash of insight which threatens to make the whole thing far, far simpler.

Every chunk gets a UUID.

This provides a system/implementation independent way to point at any piece of content in any document. It does explicitly make the chunk the smallest addressible portion, but it was already that way implicitly.

Another possibility: each chunk gets a SHA1 hash as well, enabling version-to-version delta compression at the chunk level. This way, you don't have to care where chunks might get moved to in a doc restructuring; so long as the hash is the same, the content is the same, and it can be mapped to transparently.