Tagsoup is Super!
I've never been much of a fan of screen-scraping but I seem to be doing a fair amount in my spare time recently.
For example, I put together an application to run a private NASCAR "fantasy" pool with some friends and family. One of the features I have is a live update mechanism to see where everyone stands during the race.
Only problem, I couldn't find a simple, easy data feed to get the current race information. Ok, so, the next best thing, look at one of the sports sites like the Yahoo NASCAR update and try to extract information out of the HTML.
Well, the data looks relatively well formatted, but as we all know browsers are pretty lenient about what they accept as HTML, what with missing close tags and so on, and I wanted to use an XML parser, and even XSLT to extract the data. So I needed a way to fix the HTML before passing it to a parser.
Enter Tagsoup. A SAX-compliant HTML parser that spits out well-formatted XML. Ah that sounds like just the ticket. And even better, the maintainers include a modified version of Saxon - TSaxon - to process XSLT.
So with TSaxon, in hand, it made easy work of converting something like an ESPN Qualifying Grid into SQL that I can load into Derby with an XSL like
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:html="http://www.w3.org/1999/xhtml"
version="1.0">
<xsl:param name="race"/>
<xsl:param name="season"/>
<xsl:param name="table"/>
<xsl:param name="type"/>
<xsl:template match="/">
<xsl:apply-templates select="//html:table[html:tr[html:td[@colspan='5']]]"/>
</xsl:template>
<xsl:template match="html:table">
<xsl:text>delete from </xsl:text>
<xsl:value-of select="$table"/>
<xsl:text> where season = </xsl:text>
<xsl:value-of select="$season"/>
<xsl:text> and race = </xsl:text>
<xsl:value-of select="$race"/>
<xsl:text>; </xsl:text>
<xsl:apply-templates select="html:tr[@class = 'evenrow' or @class='oddrow']"/>
<xsl:text>commit; </xsl:text>
</xsl:template>
<xsl:template match="html:tr">
<xsl:variable name="pos" select="html:td[1]"/>
<xsl:variable name="car" select="normalize-space(html:td[3])"/>
<xsl:variable name="spd" select="html:td[5]"/>
<xsl:text>insert into </xsl:text>
<xsl:value-of select="$table"/>
<xsl:text> values (</xsl:text>
<xsl:value-of select="$season"/>
<xsl:text>, </xsl:text>
<xsl:value-of select="$race"/>
<xsl:text>, </xsl:text>
<xsl:value-of select="$pos"/>
<xsl:text>, '</xsl:text>
<xsl:value-of select="$car"/>
<xsl:text>', '</xsl:text>
<xsl:value-of select="$spd"/>
<xsl:text>', </xsl:text>
<xsl:value-of select="$type"/>
<xsl:text>); </xsl:text>
</xsl:template>
</xsl:stylesheet>
By passing it to TSaxon like
java -jar lib/saxon.jar -H "$1" xsl/grid-to-sql.xsl race=$2 season=$3 table=grid type=$4
Ok, so that's pretty cool by itself. But now with the NFL season just beginning, and another private "fantasy" pool among relatives, I found that I wanted to do a similar "live tracker" to see everyone's points in the pool. This time I wanted to do it a little bit differently.
I wanted to programmatically call Tagsoup to parse a page and pass it through an XSLT. So why not try to use all the JAXP facilities in Java. And it turns out to be pretty easy. This easy
// Create an instance of Tagsoup
SAXBuilder builder = new SAXBuilder("org.ccil.cowan.tagsoup.Parser");
// Parse my (HTML) URL into a well-formed document
Document doc = builder.build(new URL(poolurl));
JDOMResult result = new JDOMResult();
JDOMSource source = new JDOMSource(doc);
// Get a JAXP Factory
TransformerFactory factory = TransformerFactory.newInstance();
// Get a Transformer with the XSL loaded.
StreamSource sheet = new StreamSource(sheetpath);
Transformer transformer = factory.newTransformer(sheet);
// Transform the page
transformer.transform(source, result);
// Spit out the result.
XMLOutputter outputter = new XMLOutputter(Format.getPrettyFormat());
outputter.output(result.getDocument(), out);
The middle section is really the part where JAXP comes into play. But as you can see its quite simple.
Oh, and say you want to just programmatically select nodes with XPath. That's pretty easy too. Here's an example that gets the Title of the page
JDOMXPath titlePath = new JDOMXPath("/h:html/h:head/h:title");
titlePath.addNamespace("h", "http://www.w3.org/1999/xhtml");
String title = ((Element) titlePath.selectSingleNode(doc)).getText();
out.println("Title is " + title);
Voila.

In the absence of Rama. Insert obligatory NASCAR crack here.
Posted by Matthew Montgomery on September 13, 2007 at 01:28 PM PDT #
I'm glad you like TagSoup. Version 1.2 is now out with many many bug fixes; you might want to grab it.
Posted by John Cowan on January 25, 2008 at 03:31 PM PST #