Monday March 02, 2009
Converting HTML to PDF using open source APIs/Tools
Howdy folks, its been a while since my last post, so I thought I would share with you some of the code I've been playing around with, converting HTML content/source files in to PDF documents. There are different snippets of codes and examples around the place (Internet), but here is the steps I took from the searches and approaches I found. I wanted to collate it all in to the one place.. May save someone hours I spent looking / finding ideas.
I am using a few different APIs for this;
1/ Apache FOP (see: http://xmlgraphics.apache.org/fop/) version 0.95
2/ I am using the following stylesheet for HTML to XML/XHTML translation, "xhtml2fo.xsl" from here: http://webcoder.info/downloads/xhtml2fo.html
3/ HTMLCleaner version 2.1 from here: http://htmlcleaner.sourceforge.net/
The first step is converting the HTML to XHTML and returning its DOM representation like so;
ByteArrayInputStream input = new ByteArrayInputStream(html.getBytes());
final HtmlCleaner cleaner = new HtmlCleaner();
CleanerProperties props = cleaner.getProperties();
DomSerializer doms = new DomSerializer(props, true);
org.w3c.dom.Document xmlDoc = null;
try {
TagNode node = cleaner.clean(input);
xmlDoc = doms.createDOM(node);
} catch (Exception e) {
e.printStackTrace();
}
where html is the HTML source/code/input.
Then using the DOM, convert the dom/XML data in to its FO representation...
org.w3c.dom.Document foDoc = null;
try {
foDoc = xml2FO(xmlDoc);
} catch (Exception e) {
System.out.println("ERROR: " + e.getMessage());
e.printStackTrace();
}
The xml2FO method looks like;
/*
* Applies stylesheet to input.
*
* @param xml The xml input Document
*
* @return Document Result of the transform
*/
private static Document xml2FO(Document xml) throws Exception {
DOMSource xmlDomSource = new DOMSource(xml);
DOMResult domResult = new DOMResult();
TransformerFactory factory = TransformerFactory.newInstance();
Transformer transformer = factory.newTransformer();
if (transformer == null) {
System.out.println("Error creating transformer");
System.exit(1);
}
try {
transformer.transform(xmlDomSource, domResult);
} catch (javax.xml.transform.TransformerException e) {
return null;
}
return (Document) domResult.getNode();
}
and from the FO output, convert the resultant output to PDF:
try {
OutputStream pdf = new FileOutputStream(new File(pdfFileName));
pdf.write(fo2PDF(foDoc, stylesheet));
} catch (java.io.FileNotFoundException e) {
e.printStackTrace();
System.out.println("Error creating PDF: " + pdfFileName);
} catch (java.io.IOException e) {
e.printStackTrace();
System.out.println("Error writing PDF: " + pdfFileName);
}
where pdfFileName is the filename of the PDF to create.
The fo2PDF method looks like;
/*
* Apply FOP to XSL-FO input
*
* @param foDocument The XSL-FO input
*
* @return byte[] PDF result
*/
private static byte[] fo2PDF(Document foDocument, String styleSheet) {
FopFactory fopFactory = FopFactory.newInstance();
try {
ByteArrayOutputStream out = new ByteArrayOutputStream();
Fop fop = fopFactory.newFop(MimeConstants.MIME_PDF, out);
Transformer transformer = getTransformer(styleSheet);
Source src = new DOMSource(foDocument);
Result res = new SAXResult(fop.getDefaultHandler());
transformer.transform(src, res);
return out.toByteArray();
} catch (Exception ex) {
return null;
}
}
/*
* Create and return a Transformer for the specified stylesheet.
*
* Based on the DOM2DOM.java example in the Xalan distribution.
*/
private static Transformer getTransformer(String styleSheet) {
try {
TransformerFactory tFactory = TransformerFactory.newInstance();
DocumentBuilderFactory dFactory = DocumentBuilderFactory.newInstance();
dFactory.setNamespaceAware(true);
DocumentBuilder dBuilder = dFactory.newDocumentBuilder();
Document xslDoc = dBuilder.parse(styleSheet);
DOMSource xslDomSource = new DOMSource(xslDoc);
return tFactory.newTransformer(xslDomSource);
} catch (javax.xml.transform.TransformerException e) {
e.printStackTrace();
return null;
} catch (java.io.IOException e) {
e.printStackTrace();
return null;
} catch (javax.xml.parsers.ParserConfigurationException e) {
e.printStackTrace();
return null;
} catch (org.xml.sax.SAXException e) {
e.printStackTrace();
return null;
}
}
-----
Give it a go, let me know what you think. Is there a better way that you know of? I am currently looking at Apaches PDFBox (see http://incubator.apache.org/pdfbox/) to extract the text and images from the output (in a print/layout friendly manner). Feedback also welcome.
Cheerio
Chris
Posted at 11:15PM Mar 02, 2009 by Chris Fleischmann in Personal | Comments[8]
Hi,
interesting article. I couldn't access the xsl you externally link to however.
The link returns a 404.
Posted by Thijs Volders on March 02, 2009 at 11:57 PM EST #
Try this link from here: http://webcoder.info/downloads/xhtml2fo.html I have also updated the blog with a different link
Posted by Chris Fleischmann on March 03, 2009 at 07:11 AM EST #
Nice work! It's not an easy job stitching together two or three different APIs to perform some task.
I was wondering about all the library jars that come with Fop. So far I have found the absolutely critical ones to be:
fop.jar
xmlgraphics-commons-1.3.1.jar
commons-logging-1.0.4.jar
avalon-framework-4.2.0.jar
batik-all-1.7.jar
commons-io-1.3.1.jar
These are all in my classpath and these are what I needed to get a simple example to successfully run through.
For my first attempt I ignored the remaining lib jars but Fop is noisy in its output, so I'll be looking at the other jars in due time.
Posted by Susan on May 07, 2009 at 09:30 AM EST #
hi there
after a long search i found this post which is very useful to me.. when i try to compile the code abobe i get the error like "FopFactory cannot be resolved"... can anyone hellp me out..
Posted by Gobinath Palanisamy on August 03, 2009 at 08:36 AM EST #
This work very well, nice job ! I have a question here, is there a way for me to feed in a css file ? The html code that i have contains some stylesheet which referencing a external css file. Please advise.
Thanks,
Posted by ken on August 12, 2009 at 11:53 AM EST #
Hi,
When I rub the code with the mentioned xsl file, I get the exception:
javax.xml.transform.TransformerException: gnu.xml.dom.DomEx: Parameter or operation isn't supported by this node.
Did anybody encounter this error?
Thanks,
Alain
Posted by Alain on August 24, 2009 at 08:06 PM EST #
It seems like you have mixed-up transformers here! You cannot use current xml2FO to convert [xml] -> [fo] without a style sheet. So, you need to modify xml2FO and fo2PDF and swap their transformers.
Posted by dmilic on September 25, 2009 at 12:57 AM EST #
hi, i am not getting how to give html file as an input file? plz help me
Posted by sreekar on November 20, 2009 at 09:52 PM EST #