e enjte shtator 09, 2004 The excellent RDF buff Danny Ayers wrote JSoup: a tag soup parser in Java, turns HTML/Bozo XML into well-formed XML.
I gave it a try and it seems to work: I tried the demo and experienced the same problem he describe (pat.rss is my feed with the missing title closing tag at channel level) with nested stuff, which is normal since it is setup in streaming mode.
java -cp .. jsoup.JSoupDemo pat.rss patfixed.rss
diff pat.rss patfixed.rss
411,412c411,412
< </channel>
< </rss>
---
> </title></channel>
> </rss>
But when I switched to non serial mode it worked fine.
java -cp .. jsoup.JSoupDemoString pat.rss patfixed2.rss
diff pat.rss patfixed2.rss
6c6
<
<link>http://blogs.sun.com/roller/page/pat</link>
---
>
</title><link>http://blogs.sun.com/roller/page/pat</link>
Today the Rome parser accepts only feeds that are well formed XML.
One way to make it more liberal would be to trigger JSoup whenever an XML parsing error happens, and then pass the cleaned up String to Rome again for a new attempt.
JSoup doesn't know how to that in serial mode, so performance will not be that good, but it's not a problem because it would be an exceptional code path.
Also this liberal behavior will be an option that you need to define. The default behavior will be non liberal.
What do you folks interested in Rome think about this proposal?
Danny, would you be willing to contribute JSoup as a rome subproject?
Because it is an optional behavior for Rome, and that JSoup can be used for other things than fixing feeds, I think it should be a subproject.
Tagsurf It
| « nëntor 2009 | ||||||
| Die | Hën | Mar | Mër | Enj | Pre | Sht |
|---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | 5 | 6 | 7 |
8 | 9 | 10 | 11 | 12 | 13 | 14 |
15 | 16 | 17 | 18 | 19 | 20 | 21 |
22 | 23 | 24 | 25 | 26 | 27 | 28 |
29 | 30 | |||||
| Today | ||||||
Today's Page Hits: 240
Posted by Lance Lavandowska on shtator 09, 2004 at 07:40 PD PDT #
Making it pluggable/switchable and as a subproject probably would be the best approach, and if the XML is ill-formed then the user should somehow be notified - a "bozo bit" along the same lines as Mark Pilgrim's parser is probably easiest (his strategies are definitely worth looking at).
Rome uses jdom right? I can't remember offhand where the parsing happens. Must have a look at the source.
Anyhow I've tried to set it up so there's quite a bit of control over the character set/encoding, although chances are many errors related to RFC 3023 will be silently ignored by the 'correct' parser.
Swapping out dodgy characters on the fly /should/ be pretty easy, the code as it stands (more through luck than judgement) does a reasonable job with things like stray '<' characters.
Apart from using the thing for cleaning full feeds, I also had in mind using it for tidying HTML content inside items/entries - there's some clunking unescaping code commented out somewhere. Also I was hoping it might be useful as a pre-processor for XSLT - I guess that would tie in with it being a subproject.
Sorry about the lack of code comments btw ;-)
(PS. I love the spam block!)
Posted by Danny on shtator 09, 2004 at 09:28 PD PDT #
Posted by Danny on shtator 09, 2004 at 09:34 PD PDT #
Posted by Patrick Chanezon on shtator 10, 2004 at 06:26 PD PDT #
Posted by Patrick Chanezon on shtator 10, 2004 at 06:28 PD PDT #
Posted by Tim Bray on shtator 15, 2004 at 10:51 PD PDT #