Friday, September 26, 2008

Grammarian First Light

At 02:14, 27 September 2008, Grammarian.pm successfully parsed the document pragma trigger of this document:

;;; hello, world
;;;
;;; simplest correct document

[[document title;'hello.carrot']]

hello, world! ;;; document plaintext

With this, 3 years of effort comes to fruition, and Carrot takes a giant step towards being real. Here's what happened:
  1. A Grammarian object is instantiated
  2. A Lexer object is instantiated within the Grammarian
  3. The Grammarian, controlling the Lexer, reads through the document until something that isn't a comment or whitespace comes along
  4. A Document pragma handler is instantiated
  5. The Document handler is asked if it handles the token the Grammarian is looking at
  6. If not, the document is nonconformant, a status is set, and processing terminates here
  7. Else, the Document handler is passed control of the Lexer
  8. If all goes well, the Document handler parses the contents of the document trigger into a datastructure, which is returned to the Grammarian
  9. The Grammarian then instantiates a DOM object, giving it the results of the Document handlers's work as the root node of the new document
  10. I do a little dance because it works It Works IT WORKS!!

Tomorrow I'll be writing (lots) more tests, and I hope to have a demo online by Wednesday, 1 October.

Tuesday, September 16, 2008

Grammarian Begin!

There's real code in there now. Not much (just enough to generate comically bad coverage numbers), but it's a start. With the Handler tests' minigrammarian routine already proving the basic idea, lashing that together in a production sort of way should be pretty easy.

The worry for me is hooking it up to the DOM, which I haven't touched in quite a while now. My irrational mind fears there has been some drift in document node format between last-working-on-DOM and writing-flock-of-Handlers. My logical mind thinks "not so much", but even if it's wrong, that's a fairly minor issue.

A much larger change would be the thought I had the other night; that the DOM should be split into a storage access layer and the actual DOM accessor/traversal methods. This feels right, but it could well be premature optimization.

It'd be a good way to get the tests for it rewritten, tho :)

Sunday, September 14, 2008

300

Changeset 300 was just committed to the Carrot repo. It was a minor change to Handler.pm with decently large consequences. Following on from the last post, I was worried about another untested case: immediately nested tags, like this...

[foo [bar a;b blah blah]]

i was worried this might turn up an edge case in valid_ending_pos -- and it did, but not in the manner I had anticipated. I thought the immediate nesting, with no intervening attributes or content, might break something. That part worked just fine, but the cuddled terminators weren't behaving properly at all.

What should happen is
  1. The Text handler sees ']]', quits processing, and returns
  2. The Tag handler is called, sees ']]', which begins with its markup terminator, and claims responsibility
  3. Tag consumes its terminating markup (']'), and pushes the remainder (also ']') back on to the Lexer's next-token stack
  4. Tag is called again and the final ] is consumed
Instead, the first call to Tag was seeing ']]' and seemingly consuming both brackets. But only seemingly; what actually happened requires a little more granularity
  1. Text sees ']]', pushes it onto the Lexer's next-token stack, quits processing, and returns
  2. Lexer, knowing that it is at physical EOF, sets its status to 1 (EOF, but I have synthetic tokens in stack)
  3. Tag is called, consuming a token from the stack, which happens to be the only token in the stack.
  4. Lexer gives Tag the token and sets its status to 99 (EOF, no synthetic tokens)
  5. Tag calls valid_ending_pos which has the statement 'return 1 if $l->status == 99' right up top. This is meant to let Triggers and Verbatim contexts always have a valid ending at EOF.
  6. Lexer's status is 99
  7. Tag does not execute the code at the bottom of valid_ending_pos, which performs the separation of terminating markup from the current token and the subsequent injection of the remainder onto Lexer's next-token stack (which would have reset its status to 1), allowing for cuddled terminators (and sloppiness like 'foo]bar').
  8. The current token, containing the terminators for both open tags, is thrown away
  9. The test suite calls Tag again to process the second terminator.
  10. Lexer returns undef; tests fail
Any sufficiently complex system will always be able to bite you in the ass. Testing is your only defense. So write more tests, today! Buy war bonds! Victory!

Stopper?

Some months in the past, back in April-June when I was dealing with the foo]] problem, I made a note to myself:
Handler valid_ending_pos woefully inadequate for Tags
But by the time I got around to reading my note, I couldn't fathom what the problem was. Clearly past-me had seen something, but present-me didn't see anything.

Today-me found it. On the Grammarian status code list, there is only one code specific to Tags.  In section 3000:

400 Trigger or verbatim found in tag content

There is no testing for status 400 anywhere in the current suite. There is also no code which sets a status 400. Consider the following.

Lorem ipsum [foo bar;baz [[quux dolor sit amet, consectetuer
adipiscing]] elit.] Sed massa mi, dapibus.

This is doubly-illegal. It's forbidden by the grammar rule embodied in status 3400 as well as the rule which says that the opening markup of triggers must be the first thing on a line. Well, the trigger-specific infringement will be caught (status 3200, as it happens), but the 3400 violation isn't caught by any code at all. And what about...

Lorem ipsum [foo bar;baz
[[quux dolor sit amet, consectetuer
adipiscing]]
elit.] Sed massa mi, dapibus.

There. Now the 3200 violation goes away (this code is well-formed) AND the 3400 violation isn't caught. Problem! Isn't it?

That depends on whose job it is to know what's inside of what. This specific example could be caught by Tag handling code, but moving the word dolor outside the trigger...

Lorem ipsum [foo bar;baz dolor
[[quux sit amet, consectetuer
adipiscing]]
elit.] Sed massa mi, dapibus.

changes all that, because now the trigger is seen by the Text handler, not the Tag handler.

Now we see that, at its root, this problem is the same as the problem which led to the creation of the VerbatimText handler: a handler must be able to introspect the layers "above" it to make the correct decision in one specific case. And now, like then, I feel the proper thing to do is to stick with the design as it stands.

This means that the proper place for code to detect this grammar violation and throw the associated status code is the Grammarian itself. The Grammarian must be responsible for all such issues where knowledge of the node context stack is needed to make the right decision (just as it is in the case of VerbatimText, which the Grammarian will explicitly call instead of the regular Text handler when a Verbatim context is atop the stack).

Saturday, September 13, 2008

Grammarian stub-out

I just typed this:

use Firepear::Carrot::DOM;
use Firepear::Carrot::Lexer;
use Firepear::Carrot::Pragma::Document;
use Firepear::Carrot::Pragma::Include;
use Firepear::Carrot::Handler::Comment;
use Firepear::Carrot::Handler::Tag;
use Firepear::Carrot::Handler::Text;
use Firepear::Carrot::Handler::Trigger;
use Firepear::Carrot::Handler::Verbatim;
use Firepear::Carrot::Handler::VerbatimText;

It made me a little tingly. It's finally coming together.

Thursday, September 11, 2008

857 tests, 911 LOC

On the eve of beginning work on the Grammarian, this is where I stand.

Tonight tests were dropped in for the 600, 701, and 702 handlers in Document; VerbatimText got its own test suite (largely a copy of Text's, which it works almost-but-not-quite-exactly like, making for very good regression testing, I think); coverage went up on Handler and VerbatimText -- it's almost all greens now, and no only one orange block in the "real" modules.

Tuesday, September 9, 2008

Document pragma validation

Finally getting back to work. As always, it's simpler than I thought it would be. Testing tomorrow. Then, the Grammarian