P@ Sunglasses

« Previous day (Sep 8, 2004) | Main | Next day (Sep 10, 2004) »

20040909 e enjte shtator 09, 2004

Danny Ayer's JSoup could be a way to make Rome a more "liberal" feed parser

The excellent RDF buff Danny Ayers wrote JSoup: a tag soup parser in Java, turns HTML/Bozo XML into well-formed XML.

I gave it a try and it seems to work: I tried the demo and experienced the same problem he describe (pat.rss is my feed with the missing title closing tag at channel level) with nested stuff, which is normal since it is setup in streaming mode.

java -cp .. jsoup.JSoupDemo pat.rss patfixed.rss
diff pat.rss patfixed.rss
411,412c411,412
<   </channel>
< </rss>
---
>   </title></channel>
> </rss>

But when I switched to non serial mode it worked fine.

java -cp .. jsoup.JSoupDemoString pat.rss patfixed2.rss
diff pat.rss patfixed2.rss
6c6
<   <link>http://blogs.sun.com/roller/page/pat</link>
---
>   </title><link>http://blogs.sun.com/roller/page/pat</link>

Today the Rome parser accepts only feeds that are well formed XML.
One way to make it more liberal would be to trigger JSoup whenever an XML parsing error happens, and then pass the cleaned up String to Rome again for a new attempt.

JSoup doesn't know how to that in serial mode, so performance will not be that good, but it's not a problem because it would be an exceptional code path.

Also this liberal behavior will be an option that you need to define. The default behavior will be non liberal.

What do you folks interested in Rome think about this proposal?
Danny, would you be willing to contribute JSoup as a rome subproject?
Because it is an optional behavior for Rome, and that JSoup can be used for other things than fixing feeds, I think it should be a subproject.

( Sht 09 2004, 04:28:22 PD PDT ) Permalink Comments [6] Chat about it Technorati cosmos Tagsurf It


Valid HTML! Valid CSS!

This is a personal weblog, I do not speak for my employer.