1. Sep 5th, 2005

    Ruby Microcontent Parser

    That took a while longer than I expected to, but finally it’s here.

    Basically, it’s a framework for writing microcontent parsers. A microcontent parser is a class with a set of rules for extracting interesting content from (X)HTML documents. You create your own parser by writing a class with a set of rules.

    The magic happens in the _parse_ method which taks an (X)HTML document or element, runs all the rules on it, and returns new object that holds the extracted valus.

    Here’s an example:

    class MyParser
    include MicrocontentParser
    
    rule :links, "a", "a@href"
    rule :tags, "a[rel~=tag]", "text()"
    end
    
    content = MyParser.parse(doc)
    puts "Found " + content.links.size + " links" if content.links
    puts "Tagged with " + content.tags.join(', ') if content.tags

    The class _MyParser_ is a microcontent parser with two rules. The first rule extracts the URLs of all the links in the document, and adds them to the _links_ array. The second rule extracts all the tag names in the document, and adds them to the _tags_ array.

    The call to _parse_ returns an object of type _MyParser_ with all links and tags extracted from the document.

    The documentation (and there’s a lot more features to learn about) and source code are all here: http://trac.labnotes.org/cgi-bin/trac.cgi/wiki/MicroParserRuby

    Your comment, here ⇓