Monday March 02, 2009
Converting HTML to PDF using open source APIs/Tools
Howdy folks, its been a while since my last post, so I thought I would share with you some of the code I've been playing around with, converting HTML content/source files in to PDF documents. There are different snippets of codes and examples around the place (Internet), but here is the steps I took from the searches and approaches I found. I wanted to collate it all in to the one place.. May save someone hours I spent looking / finding ideas.
I am using a few different APIs for this;
1/ Apache FOP (see: http://xmlgraphics.apache.org/fop/) version 0.95
2/ I am using the following stylesheet for HTML to XML/XHTML translation, "xhtml2fo.xsl" from here: http://webcoder.info/downloads/xhtml2fo.html
3/ HTMLCleaner version 2.1 from here: http://htmlcleaner.sourceforge.net/
The first step is converting the HTML to XHTML and returning its DOM representation like so;
ByteArrayInputStream input = new ByteArrayInputStream(html.getBytes());
final HtmlCleaner cleaner = new HtmlCleaner();
CleanerProperties props = cleaner.getProperties();
DomSerializer doms = new DomSerializer(props, true);
org.w3c.dom.Document xmlDoc = null;
try {
TagNode node = cleaner.clean(input);
xmlDoc = doms.createDOM(node);
} catch (Exception e) {
e.printStackTrace();
}
where html is the HTML source/code/input.
Then using the DOM, convert the dom/XML data in to its FO representation...
org.w3c.dom.Document foDoc = null;
try {
foDoc = xml2FO(xmlDoc);
} catch (Exception e) {
System.out.println("ERROR: " + e.getMessage());
e.printStackTrace();
}
The xml2FO method looks like;
/*
* Applies stylesheet to input.
*
* @param xml The xml input Document
*
* @return Document Result of the transform
*/
private static Document xml2FO(Document xml) throws Exception {
DOMSource xmlDomSource = new DOMSource(xml);
DOMResult domResult = new DOMResult();
TransformerFactory factory = TransformerFactory.newInstance();
Transformer transformer = factory.newTransformer();
if (transformer == null) {
System.out.println("Error creating transformer");
System.exit(1);
}
try {
transformer.transform(xmlDomSource, domResult);
} catch (javax.xml.transform.TransformerException e) {
return null;
}
return (Document) domResult.getNode();
}
and from the FO output, convert the resultant output to PDF:
try {
OutputStream pdf = new FileOutputStream(new File(pdfFileName));
pdf.write(fo2PDF(foDoc, stylesheet));
} catch (java.io.FileNotFoundException e) {
e.printStackTrace();
System.out.println("Error creating PDF: " + pdfFileName);
} catch (java.io.IOException e) {
e.printStackTrace();
System.out.println("Error writing PDF: " + pdfFileName);
}
where pdfFileName is the filename of the PDF to create.
The fo2PDF method looks like;
/*
* Apply FOP to XSL-FO input
*
* @param foDocument The XSL-FO input
*
* @return byte[] PDF result
*/
private static byte[] fo2PDF(Document foDocument, String styleSheet) {
FopFactory fopFactory = FopFactory.newInstance();
try {
ByteArrayOutputStream out = new ByteArrayOutputStream();
Fop fop = fopFactory.newFop(MimeConstants.MIME_PDF, out);
Transformer transformer = getTransformer(styleSheet);
Source src = new DOMSource(foDocument);
Result res = new SAXResult(fop.getDefaultHandler());
transformer.transform(src, res);
return out.toByteArray();
} catch (Exception ex) {
return null;
}
}
/*
* Create and return a Transformer for the specified stylesheet.
*
* Based on the DOM2DOM.java example in the Xalan distribution.
*/
private static Transformer getTransformer(String styleSheet) {
try {
TransformerFactory tFactory = TransformerFactory.newInstance();
DocumentBuilderFactory dFactory = DocumentBuilderFactory.newInstance();
dFactory.setNamespaceAware(true);
DocumentBuilder dBuilder = dFactory.newDocumentBuilder();
Document xslDoc = dBuilder.parse(styleSheet);
DOMSource xslDomSource = new DOMSource(xslDoc);
return tFactory.newTransformer(xslDomSource);
} catch (javax.xml.transform.TransformerException e) {
e.printStackTrace();
return null;
} catch (java.io.IOException e) {
e.printStackTrace();
return null;
} catch (javax.xml.parsers.ParserConfigurationException e) {
e.printStackTrace();
return null;
} catch (org.xml.sax.SAXException e) {
e.printStackTrace();
return null;
}
}
-----
Give it a go, let me know what you think. Is there a better way that you know of? I am currently looking at Apaches PDFBox (see http://incubator.apache.org/pdfbox/) to extract the text and images from the output (in a print/layout friendly manner). Feedback also welcome.
Cheerio
Chris
Posted at 11:15PM Mar 02, 2009 by Chris Fleischmann in Personal | Comments[8]