I’m a newcomer to scrAPI and I must say that it really saved me some time. Thanks!
However, I still have some doubts so I’ve decided to ask. For example, how would you parse something like this without using regular expressions?: adasfasfda adasfasfda
adfasdfa
adfasdfa
The aim is to get only the first two paragraphs, but not the last two and get as a result the contents of tag in one variable and the rest of tag in another one.
Sorry the original markup was interpreted by the blog engine. Let’s try again
a adfasfafas a adfasfafas
adfasfafas
adfasfafas
–>
Again, the aim is to get only the first two paragraphs, but not the last two and get as a result the contents of tag in one variable and the rest of tag in another one.
You can get just the <b> elements by selecting for “b”. And once you get them, you can remove them from the element (leaving just the rest of the text) by calling detach on each element.
So:
bolds = HTML::Selector.new(“b”).select(element)
bolds.each { |b| b.detach }
When you’re dealing with the text itself (e.g. you want the first two lines but not the second), use regular expressions.
Hello Assaf,
thanx a lot for great library.
But I have few questions regarding it:
1. If yiu use tidy, why not to build xml and extract content with XPath
2. Is it possible to make lgical selectors like
E[foo="bar" && foo2="bar2"]
I don’t know css selectors so well, but is it possible to make selection in css like this xpath
//E[contains(/F/@class,'some')] –
E with F child which has ‘some’ substring in class attribute
I am thinking of adding this feature to scrAPI. The CSS syntax could look like:
E:contains(F.some)
The reason it’s not there yet is very simple. Extending the syntax so it’s simple to understand, and does 80% of the work is easy. A simple :contains will work.
Extending the syntax to do the other 20% is very hard, and I think the best way is to just use Ruby blocks. But I need a few more examples of real code so I can extract a clean and simple pattern out of them.
You have done an incredible job with scrAPI. I am learning Ruby mainly to take advantage of your library. So far, I have only scraped one complex page and the process was smooth enough so far. I have to see if my code is going to be maintainable going forward.
One thing I tried is enhancements suggested at http://www.quarkruby.com/2008/1/30/scrapi-enhancements/. I liked the idea but ran in to couple of issues with those changes. Are you planning to implement such feature? Their blog has also helped in using scrAPI.
One other thing I was wondering was can you point me to performance tips in using scrAPI and Tidy so that I can make sure my code is perf conscious. I hear horror stories about ruby performance in general.
Also, I noticed there is not much activity on scrAPI. Is there a plan to improve/maintain going forward.
Thank you for all your efforts in building scrAPI.