That took a while longer than I expected to, but finally it’s here.
Basically, it’s a framework for writing microcontent parsers. A microcontent parser is a class with a set of rules for extracting interesting content from (X)HTML documents. You create your own parser by writing a class with a set of rules.
The magic happens in the _parse_ method which taks an (X)HTML document or element, runs all the rules on it, and returns new object that holds the extracted valus.
Here’s an example:
class MyParser
include MicrocontentParser
rule :links, "a", "a@href"
rule :tags, "a[rel~=tag]", "text()"
end
content = MyParser.parse(doc)
puts "Found " + content.links.size + " links" if content.links
puts "Tagged with " + content.tags.join(', ') if content.tags
The class _MyParser_ is a microcontent parser with two rules. The first rule extracts the URLs of all the links in the document, and adds them to the _links_ array. The second rule extracts all the tag names in the document, and adds them to the _tags_ array.
The call to _parse_ returns an object of type _MyParser_ with all links and tags extracted from the document.
The documentation (and there’s a lot more features to learn about) and source code are all here: http://trac.labnotes.org/cgi-bin/trac.cgi/wiki/MicroParserRuby