1. Aug 18th, 2005

    Microcontent Parser

    Slowly and steadily my microformat parser is starting to take shape. I hope to have the source code released in a week or so.

    The first decision I had to make is how to write the parsing rules. Schema-based parsers look like a good idea on paper, but in my experience are too reliant on a limited language and too fragile. At some point you reach a wall, you switch back to code, only to realize code is not that hard, and the “simplicity” of schemas is quite decieving.

    I decided instead to build the parser around a simple functional model. For each rule, one function to match content, and one function to extract it. Generally, these functions are really simple to write and get the results you want. But there’s no limit to how complex you can make them. And you can always generate them from a schema.

    The downside to this approach is RSI. Too many functions to write all of which do almost the same stuff, just with different values. So to make the easy stuff easy, I’ve decided on a declarative way to parameterize them.

    To match content I’m using CSS-like selectors, so ‘.vevent’ will match any element with the class ‘vevent’, ‘li’ will match any list item, etc. If you pass a selector string, it gets converted to a function, eliminating a lot of the coding. To extract content I’m borrowing a bit from programming languages and a bit from XPath, so ‘dtstart=abbr@title|text()’ will get you the event’s start time from either the title attribute of the abbr element, or the text value for any other element.

    What remains to be done is running multiple rules on the same content in parallel, keeping track of the individual states (more code to write, but less lines of code to run). Once that’s solved, the code will go [here](http://trac.labnotes.org/cgi-bin/trac.cgi/wiki/MicroParser). Watch this space for the announcement.

    tags: microcontent parser microformats

    Your comment, here ⇓