Chris' SWI blog

« Previous day (Feb 28, 2009) | Main | Next day (Mar 2, 2009) »

http://blogs.sun.com/chrisf/date/20090302 Monday March 02, 2009

Converting HTML to PDF using open source APIs/Tools

Howdy folks, its been a while since my last post, so I thought I would share with you some of the code I've been playing around with, converting HTML content/source files in to PDF documents. There are different snippets of codes and examples around the place (Internet), but here is the steps I took from the searches and approaches I found. I wanted to collate it all in to the one place.. May save someone hours I spent looking / finding ideas.

I am using a few different APIs for this;

1/ Apache FOP (see: http://xmlgraphics.apache.org/fop/) version 0.95
2/ I am using the following stylesheet for HTML to XML/XHTML translation, "xhtml2fo.xsl" from here: http://webcoder.info/downloads/xhtml2fo.html
3/ HTMLCleaner version 2.1 from here: http://htmlcleaner.sourceforge.net/

The first step is converting the HTML to XHTML and returning its DOM representation like so;

ByteArrayInputStream input = new ByteArrayInputStream(html.getBytes());

final HtmlCleaner cleaner = new HtmlCleaner();
CleanerProperties props = cleaner.getProperties();

DomSerializer doms = new DomSerializer(props, true);

org.w3c.dom.Document xmlDoc = null;

try {
    TagNode node = cleaner.clean(input);
    xmlDoc = doms.createDOM(node);
} catch (Exception e) {
    e.printStackTrace();
}


where html is the HTML source/code/input.

Then using the DOM, convert the dom/XML data in to its FO representation...

org.w3c.dom.Document foDoc = null;

try {
    foDoc = xml2FO(xmlDoc);
} catch (Exception e) {
    System.out.println("ERROR: " + e.getMessage());
    e.printStackTrace();
}


The xml2FO method looks like;

/*
 *  Applies stylesheet to input.
 *
 *  @param xml  The xml input Document
 *
 *  @return Document  Result of the transform
 */
private static Document xml2FO(Document xml) throws Exception {
    DOMSource xmlDomSource = new DOMSource(xml);
    DOMResult domResult = new DOMResult();

    TransformerFactory factory = TransformerFactory.newInstance();
    Transformer transformer = factory.newTransformer();

    if (transformer == null) {
       System.out.println("Error creating transformer");
       System.exit(1);
    }

    try {
        transformer.transform(xmlDomSource, domResult);
    } catch (javax.xml.transform.TransformerException e) {
        return null;
    }

    return (Document) domResult.getNode();
}


and from the FO output, convert the resultant output to PDF:

try {
    OutputStream pdf = new FileOutputStream(new File(pdfFileName));
    pdf.write(fo2PDF(foDoc, stylesheet));
} catch (java.io.FileNotFoundException e) {
    e.printStackTrace();
    System.out.println("Error creating PDF: " + pdfFileName);
} catch (java.io.IOException e) {
    e.printStackTrace();
    System.out.println("Error writing PDF: " + pdfFileName);
}


where pdfFileName is the filename of the PDF to create.

The fo2PDF method looks like;

/*
 *  Apply FOP to XSL-FO input
 *
 *  @param foDocument  The XSL-FO input
 * 
 *  @return byte[]  PDF result
 */
private static byte[] fo2PDF(Document foDocument, String styleSheet) {
    FopFactory fopFactory = FopFactory.newInstance();

    try {
        ByteArrayOutputStream out = new ByteArrayOutputStream();

        Fop fop = fopFactory.newFop(MimeConstants.MIME_PDF, out);
        Transformer transformer = getTransformer(styleSheet);

        Source src = new DOMSource(foDocument);
        Result res = new SAXResult(fop.getDefaultHandler());

        transformer.transform(src, res);

        return out.toByteArray();

    } catch (Exception ex) {
        return null;
    }
 }


/*
 *  Create and return a Transformer for the specified stylesheet.
 * 
 *  Based on the DOM2DOM.java example in the Xalan distribution.
 */
private static Transformer getTransformer(String styleSheet) {
    try {
        TransformerFactory tFactory = TransformerFactory.newInstance();

        DocumentBuilderFactory dFactory = DocumentBuilderFactory.newInstance();
        dFactory.setNamespaceAware(true);

        DocumentBuilder dBuilder = dFactory.newDocumentBuilder();
        Document xslDoc = dBuilder.parse(styleSheet);
        DOMSource xslDomSource = new DOMSource(xslDoc);

        return tFactory.newTransformer(xslDomSource);
    } catch (javax.xml.transform.TransformerException e) {
        e.printStackTrace();
        return null;
    } catch (java.io.IOException e) {
        e.printStackTrace();
        return null;
    } catch (javax.xml.parsers.ParserConfigurationException e) {
        e.printStackTrace();
        return null;
    } catch (org.xml.sax.SAXException e) {
        e.printStackTrace();
        return null;
    }
}


-----

Give it a go, let me know what you think. Is there a better way that you know of? I am currently looking at Apaches PDFBox (see http://incubator.apache.org/pdfbox/) to extract the text and images from the output (in a print/layout friendly manner). Feedback also welcome.

Cheerio

Chris


Valid HTML! Valid CSS!

This is a personal weblog, I do not speak for my employer.