P@ Sunglasses

« Charlotte sometimes | Main | Sunshine at sfo... »

20040909 e enjte shtator 09, 2004

Danny Ayer's JSoup could be a way to make Rome a more "liberal" feed parser

The excellent RDF buff Danny Ayers wrote JSoup: a tag soup parser in Java, turns HTML/Bozo XML into well-formed XML.

I gave it a try and it seems to work: I tried the demo and experienced the same problem he describe (pat.rss is my feed with the missing title closing tag at channel level) with nested stuff, which is normal since it is setup in streaming mode.

java -cp .. jsoup.JSoupDemo pat.rss patfixed.rss
diff pat.rss patfixed.rss
411,412c411,412
<   </channel>
< </rss>
---
>   </title></channel>
> </rss>

But when I switched to non serial mode it worked fine.

java -cp .. jsoup.JSoupDemoString pat.rss patfixed2.rss
diff pat.rss patfixed2.rss
6c6
<   <link>http://blogs.sun.com/roller/page/pat</link>
---
>   </title><link>http://blogs.sun.com/roller/page/pat</link>

Today the Rome parser accepts only feeds that are well formed XML.
One way to make it more liberal would be to trigger JSoup whenever an XML parsing error happens, and then pass the cleaned up String to Rome again for a new attempt.

JSoup doesn't know how to that in serial mode, so performance will not be that good, but it's not a problem because it would be an exceptional code path.

Also this liberal behavior will be an option that you need to define. The default behavior will be non liberal.

What do you folks interested in Rome think about this proposal?
Danny, would you be willing to contribute JSoup as a rome subproject?
Because it is an optional behavior for Rome, and that JSoup can be used for other things than fixing feeds, I think it should be a subproject.

( Sht 09 2004, 04:28:22 PD PDT ) Permalink Comments [6] Chat about it Technorati cosmos Tagsurf It

Comments:

I had much the same thoughts when I saw Danny's post. I'd like to see it "integrated", but in an optional manner as you suggest. Dave Johnson has mentioned creating a parser along the lines of Kevin Burton's suggestion (sorry, cannot find link): delete any "bad" characters and reparse. I'm not sure how that fits with Danny's non-valid-XML cleaner, but thought I'd mention it.

Posted by Lance Lavandowska on shtator 09, 2004 at 07:40 PD PDT #

If it might be useful, that'd be great, please use it however you can. Put it under the same license as the rest of Rome (or whatever) - I don't really care (!) - but I would be *very* grateful for any fixes.

Making it pluggable/switchable and as a subproject probably would be the best approach, and if the XML is ill-formed then the user should somehow be notified - a "bozo bit" along the same lines as Mark Pilgrim's parser is probably easiest (his strategies are definitely worth looking at).

Rome uses jdom right? I can't remember offhand where the parsing happens. Must have a look at the source.

Anyhow I've tried to set it up so there's quite a bit of control over the character set/encoding, although chances are many errors related to RFC 3023 will be silently ignored by the 'correct' parser.

Swapping out dodgy characters on the fly /should/ be pretty easy, the code as it stands (more through luck than judgement) does a reasonable job with things like stray '<' characters.

Apart from using the thing for cleaning full feeds, I also had in mind using it for tidying HTML content inside items/entries - there's some clunking unescaping code commented out somewhere. Also I was hoping it might be useful as a pre-processor for XSLT - I guess that would tie in with it being a subproject.

Sorry about the lack of code comments btw ;-)

(PS. I love the spam block!)

Posted by Danny on shtator 09, 2004 at 09:28 PD PDT #

PPS. Just remembered, there was another little use - finding <link> tags in blog HTML for autodiscovery of feeds etc. That was the initial motivation for streaming, so once the feed link was found the download could stop.

Posted by Danny on shtator 09, 2004 at 09:34 PD PDT #

Thanks very much Danny. I'll then create a Rome subproject with it, so that it can be used for other purposes (like before xslt, or to parse html), with the right package names and stuff. I'll do that in the next few weeks, no time just right now. I agree with the idea of setting a bozo bit accessible to the client, with potentially some message explaining the problems encountered and fixed (maybe that will be a later enhancement). Alejandro takes care about the encoding problems in the rome fetcher. I did not grok that yet so don't know if we'll need to be aware of that in this one (JSoup happens after the fetching). But I've seen you've joined the mailing list and started discussing this. Then I'll integrate it in rome as an optional component with an api to set the use of it or not. P@

Posted by Patrick Chanezon on shtator 10, 2004 at 06:26 PD PDT #

Lance, thanks for the info. I've read that article, don't remember where, recently. I haven't looked at Danny's code yet to determine wether he follows this approach, but yes it is an option to look at.

Posted by Patrick Chanezon on shtator 10, 2004 at 06:28 PD PDT #

Mark Pilgrim's research shows that the percentage of broken feeds is really low. The experience of Syndic8 shows that when you point out to people that their feeds are broken, they are grateful and quickly fix them. (Note that many otherwise-OK feeds are served with the wrong media-type which makes them in theory busted, but this is a different kind of recovery). So... this is a bad idea. If the feed is broken, report it. It'll probably get fixed. If it doesn't, screw 'em. Hiding protocol errors on the Internet is not good policy. -Tim

Posted by Tim Bray on shtator 15, 2004 at 10:51 PD PDT #

Post a Comment:

Comments are closed for this entry.

Valid HTML! Valid CSS!

This is a personal weblog, I do not speak for my employer.