Based on your feedback, I decided to change the behavior so processing rules no longer “consume” the element they process. Instead, if you decide that you don’t want to process that element (and its children) with any other rule, either call the
skip method, or pass the argument
:skip=>true. The old behavior was premature optimization (bad), the new one is more explicit and easier to control.
Out of that, I extracted a Microformats helper for Rails. And it was only reasonable I use one piece of code to produce the output, another piece of code to test it. So I wrote a simple hAtom scraper using scrAPI. It’s an early release that does hAtom and very basic hCard, but it’s worth checking out. It’s also an example of how to write scrapers, I incorporated a few tips and tricks in there.
You can find it in
Last notable change is the addition of a
collect() method that gets called before
result(). It turned out essential, for example, when working with hAtom, if the update date/time is missing it defaults to published. That all happens during