Saturday, March 15, 2008

How Carrot Works -- I: Overview

Welcome to the first article in an ongoing series, in which I attempt to actually explain the stuff that I usually blather on about without backreference or elucidation of any sort.

This is a high-level overview of the extant design for the Carrot processing flow. For a basic overview of the language itself, see the 10km document.

The Grammarian
Compilation of a Carrot document is a 2 phase process. In the first phase, a subsystem called the Grammarian examines the document for well-formedness. That is, it examines the document to make sure that it is written in a grammar of Carrot. The Grammarian proper is a message-passer/traffic cop between a small flock of components which handle the majority of the actual work of this phase of processing.

The Lexer does all input filehandling and decomposes the resultant stream of text into tokens. There are only three kinds of tokens: whitespace strings, nonwhitespace strings, and newlines.

Tokens from the lexer are examined by a group of Handler components. There is one handler for each type of Carrot markup, and one each for special cases like the document and include triggers (the pragma). These handlers construct nodes from the tokens, with each node being a lexical chunk of the document and all available metadata for it.

These nodes are then passed to the DOM (Document Object Model) component, which does the bulk of well-formedness checking and then stows properly constructed nodes into the document tree (which the DOM also provides read access to).

Assuming all goes well, the entire source document will be transformed into a datastructure which encodes the content and structure of the original, along with explicit and inferred metadata for later processing and (possible) error reporting.

The Ontologist
Phase 2 begins with another subsystem, called the Ontologist, beginning a directed walk of the DOM's tree. It is the Ontologist's job to make sure that the document is written in a valid dialect of Carrot. This is neccessary because, in its purest form, Carrot is, like XML, a language construction kit.

In any event, the Ontologist examines each node in isolation (This node says it's a trigger and its element type is "foobar". Is there a foobar trigger in my current dialect? This node contains an attribute "baz". Is that attribute valid for this element, and is its value of the defined type and within any defined bounds?) and in relation to other nodes (Is this element allowed within the current scope(s)?).

The Ontologist's validity rules are expressed in a Carrot grammar called Radish, which will be the topic of an entire post later on. (Yes, Carrot bootstraps itself, using itself.)

If the Ontologist successfully walks the entire tree, then the document is both well-formed and valid, and compilation is complete. From this point on, it's about what the user wants to do with the document. And that'll be the topic of the next How Carrot Works post.

Friday, March 14, 2008

Generic Handlers complete. Next up: Pragma

The regular Handlers are complete and tested (though not (quite) yet tested to 100% code coverage). Text.pm is 36 lines, according to sloccount. Maybe I'm finally getting good at estimating things relating to code.

In my next update, to be written when I'm more awake, I shall talk at some length about exactly what the hell is going on here.

I'm seeing weirdness with Devel::Cover and subclasses. It refuses to believe that I'm testing methods which subclasses redefine. This is very unhappy-making :(

Tuesday, March 11, 2008

Shock and Awe

When I started work on the handlers, I figured I could get maybe three-quarters of the code for Triggers and Verbatims shared through the generic Handler, and maybe half of what Tags would need, and perhaps one-quarter of Text chunks.

As it turns out, with not much effort, and very little in the way of redundancy or special-casing, 100% of the code for Triggers, Verbatims, and Tags is shared. Their modules are bare subclasses of the Handler superclass. The empty modules are still neccessary, though, because of the handful of places where the shared code does behave differently based on what class it's executing as (and it leaves a clean, clear maintenance path for when I start finding the things I've missed and the second, customized parsing stage becomes neccessary).

So all my careful work in designing that two-stage parsing pass was either wasted or wildly successful, depending on your point of view :)

The Text handler does actually get rewritten methods -- two of them -- and one unique to itself, but it's far simpler than the other handlers, so the loccount for these additions will still be under 50, I think.

All the handlers except Text are testing clean right now as well. Handler.pm's coverage is up to 91% overall. In a scarily short time, it will be time to lash the handlers to the Grammarian (which is really to say: to the DOM).

Friday, March 7, 2008

Subclasses ahoy

Since my last post, the generic Handler has been tested about as far as it can be in a standalone more, and coding has begun on the specific subclass handlers (Trigger, Tag, Verbatim, Text, Document).

I'm not entirely sure yet, but I think the Use and Import pragmas will be implemented as subclassed handlers, acting in the same way as the Document trigger.

At the moment, things are moving quickly. I'm not sure how long this will last.