Carrot

Thursday, October 1, 2009

Parser design

The Lexer used a sort of trie-lite to do its work. The parser will be driven by a full-blown trie. Here's a graphviz diagram of it:

As you can see, the tag markup structure, which looks like

[thing]

is the most overloaded, but that's Huffman coding for you :)

Anyhow, the key to understanding this diagram is knowing that everywhere you see "printable", "whitespace", or "hex", that means "any number of these characters, consecutively". The only exception is "printable-1", which is read "one printable character".

That and the unicode entity stuff look like they create some really hairy exceptions, but the truth is that due to the way the reference implementation (spefically, the Lexer) operates, it'll all come in as one wad. A simple regex, combined with the lookahead token, will tell if it's a entity or not. It's shown exploded here, but those subtrees may be optimized completely out of the live trie.

Sunday, September 13, 2009

Message passing

I'm working on a primitive message-passing mechanism to allow objects to request that other objects execute methods on demand. Sort of a remote procedure call, you might say. The basics are as follows:

Every object which inherits from Carrot::Universal has a 'children' slot and a 'rpc_ok' slot. The first stores a reference to every object that the present object "owns", or has created. The second is a list of methods in the present object which are allowed to be called by this mechanism.
There's a 'pass' method, which creates a standard Carrot::Msg object and uses it as an argument to the 'catch' method of each object in the present object's 'children' slot.
The message created by 'pass' has a hashref instead of a string as its content. This hashref contains
1. The caller object's classname
2. The intended recipient class's name (will be matched as an ending regex, not a strict string equivalence, so 'Lexer' will work as well as Carrot::Lexer)
3. The method to be called
4. Arguments to the method
If the recipient classname matches the name of the called object, and if the given method is on the recipient's 'rpc_ok' list, then thy will be done.
Matching or not, the recipient object will pass the message on to all of its children.

Why? This is the mechanism that lets (definitely some, possible all of) the pragmas work. In the new design, the level at which pragmas are recognized (the Parser) is, for instance, 2 levels removed from where markup characters are tokenized as such (the Tokenizer). The possible solutions were

to implement a custom pass-along method for EVERY method that ends up being at greater than one remove from where it ends up being needed
to pass handles to objects to all other objects (AKA "the last design, to an extent")
to implement a generic message passing mechanism

The last one seems cleanest, most maintainable, and most useful in the long-term to me.

Wednesday, September 9, 2009

The Way Forward

I don't mean how to complete Carrot. Though that is distressingly far from a fait accompli, I have somehow managed to convince myself that I'm capable of cracking all the remaining problems -- if by no other mechanism, then my letting it be ugly where it must in order to be done.

What I've been thinking of lately is the larger problem; the one that comes after Carrot; the one which Carrot is a gateway to solving. The archival and publishing problem. I've been thinking on it for years, and today I got a flash of insight which threatens to make the whole thing far, far simpler.

Every chunk gets a UUID.

This provides a system/implementation independent way to point at any piece of content in any document. It does explicitly make the chunk the smallest addressible portion, but it was already that way implicitly.

Another possibility: each chunk gets a SHA1 hash as well, enabling version-to-version delta compression at the chunk level. This way, you don't have to care where chunks might get moved to in a doc restructuring; so long as the hash is the same, the content is the same, and it can be mapped to transparently.

Monday, December 22, 2008

Parallel Development

Now that I'm using git, I have powerful new options available for working on projects, like spawning as many single-feature development branches as I feel like. Today I spun one off and started overhauling the Carrot status system.

The first problem was that there were a lot of statuses, and while referring to them by numeric id sounds cool (it works for Apache), it's a little less than ideal when you're reusing ids between modules. I wrote the damn thing, and I was constantly having to refer to a wiki page to tell me what statuses meant. This has been (partly) fixed by moving the lists of statuses into the modules themselves, in a __DATA__ segment which populates a hashref at constructor time.

The other part of the fix is that statuses are no longer set by passing the 'status' method a number and having that be that. This had all the problems from the above paragraph, plus a complete lack of validity checking. Now statuses are set by passing the short "name" value which acts as a key on the statuses hashref. This makes status changes much more clear (status('EBADPERMS') vs. status(103)), and now passing an undefined name to status causes an abend.

A side-effect of this is that the code is now self-documenting with respect to its statuses, and the list of statuses can't get out-of-sync without the code exploding.

Lastly, the notion of a status being set has been divorced from the notion of something having gone wrong. Several of the Carrot modules now have non-error statuses beyond '0' for "still running" and '1' for "all done" (and, as a side note, '1' is no longer consistent for that -- this needs cleanup and '99' should be the new standard for "ops complete"). Since error statuses start at code 100, the status method now sets an error flag when a code above 99 is set. Soon, all modules will be updated to halt and return undef on error instead of on status.

EDIT: ONE YEAR ANNIVERSARY OF CARROT BLOG AND THE RESTART OF SERIOUS DEVELOPMENT

(...so far yet to go...)

Sunday, December 14, 2008

DOM Update 1

'nukenode' is no more. This is good. If a call to 'add' fails, that's really a fatal error -- there is no reason to cruft up the module with code which sloppily backs out the partial changes. I'm still hoping to factor away the partial changes bit and get back to having adds be atomic with respect to the datastore.

Saturday, December 13, 2008

Movement

First off, I'm now using git for local revision control.

Secondly, I've spent some time recently thinking about what the DOM should do -- oppose thinking about what's wrong with it right now -- and how to most efficiently make those things happen. I believe the result will be much slimmer code which is much more capable.

The DOM as it exists now has 2 fundamental problems. First, as noted previously, it was written over two years in advance of the Grammarian, so it's filled with bad guesses about how things would eventually work out. Second, I wrote it while having a fit of needing to prove to myself that I could write something like this, so it's full of questionable and/or mildly baroque design choices.

It'll be better soon.

Sunday, November 2, 2008

Speaking of refactoring

The stuff in the last post is all AFTER fixing the hash/hashref argument problem.

Also, Handler/Commenter.pm is no more. Its code and a genericized status() have been folded into Util.pm.