Wednesday, 01 Aug 2007
Wednesday, 01 Aug 2007
Ok, now that we let the cat out of the bag, my inbox is filled with some mails asking for more information on the PDF import filter we're going to implement. So, I'd like to give you some details that are yet known, but still discussable if somebody comes up with a better idea:
As already mentioned in my comment regarding the initial blog entry, it won't be an option for us to import the PDF content into a Writer document containing floating text and as such a floating layout. So, we decided to write a filter that imports the PDF content as OOo Draw/Impress document.
With this solution, we'll have the full benefit of a page orientated, fixed layout. All graphical elements will be at fixed positions given in the PDF file and text portions will be combined as most as possible to be anchored in text shapes, ensuring that text portions preserve their exactly given position, but are still editable by the user.
The challenge with this solution is 'just' to find the most common bounding box for text portions that can be grouped together in one text shape. But this is nothing compared to the 'impossible' and life time task of reconstructing/guessing the whole layout of the original document the PDF document was created from. As you know, PDF files don't contain such structuring information in general, beside some tagged PDF files, on which we can't rely.
The next question that arises for development is, what kind of parser to use for reading the basic content of the PDF file. There exists a well known and widely used framework for this: the XPDF library and its derivatives like Poppler. Yeah, that would be a great and well tested framework for us, but unfortunately, it doesn't match with the OOo code licensing, at least at the moment. So, we'll have to write our own parser for this task, which is not bad at all due to the fact that XPDF still lacks some features we would have to implement in either case.
The filter itself will be available as a downloadable extension to the standard OOo release. This perfectly fits in our roadmap to create a more unitized OOo packet, consisting of several 'standalone' components, reusable in other context.
The most interesting question that came up is that of the timeline for this implementation. Please expect to have the product version of the filter ready for the OOo 3.0 release latest. A detailed release plan for OOo 3.0 is not known at the moment. But, as already mentioned, I expect to have first results available within a few months, so that most of you will be able to enjoy playing around with a pre-release of this filter till the end of this year. We will definitely need your feedback regarding this first release and upcoming ones to add missing parts, fix bugs etc.
Some of you asked, if there will be some additional goodies around the whole PDF story in OOo. The answer for this question is 'Yes, there will be some more stuff around the pure import and export filters'. One example for this would be the support for PDF/A, a feature that is currently implemented by community member Giuseppe Castagno.
Another example would be the support for creating PDF documents containing the original ODF document itself, allowing to read the original content without loss by any ODF enabled application.
I hope that this blog entry answers the most urgent questions for the moment. Please don't hesitate to add any comments, questions, suggestions etc. you have.
Comments
Posted by Jörg Weidinger on August 01, 2007 at 07:23 PM CEST #
Posted by Mike on August 02, 2007 at 07:09 AM CEST #
Joerg: I'd really like to find a solution with XPDF/Poppler and don't have any interest in writing a new parser for this task. I hope there will be a way to use it, but with the current licensing models, it's just not possible, excluding an 'Out Of Process' solution with a little GPL part around it for IPC usage. Since this would be a nasty solution, I don't want to go this way.
Mike: Right. The scenario your'e mentioning was also discussed within our meetings and we did considered it a valid scenario for many users. In a perfect world, one would expect to get the PDF document opened in Writer with a layout as same as in Acrobat Reader but with full editing capabilities. Since this is not possible for the reasons I already described, there are many possible, but always inferior solutions, to address this. We decided to go the layout preserving way first, which is only possible with a page orientated and position preserving solution. You'll still be able to extract the textual content, since this will be included in the Draw/Impress document as editable/copyable text shapes. This also doesn't mean that there won't be a second or third solution that exactly fits yours and others needs. It will be a really small effort to convert the created Draw/Impress document to a Writer document just containing the text, using an XSLT for example. So, there will be further addons available, that will perfectly match your request. It's just a matter of priorities.
Posted by Kai Ahrens on August 02, 2007 at 10:09 AM CEST #
Posted by Chris Puttick on August 02, 2007 at 09:02 PM CEST #
Posted by Mike on August 03, 2007 at 09:01 AM CEST #
Posted by Patrick on August 03, 2007 at 10:02 AM CEST #
Posted by Jitin on August 04, 2007 at 09:56 PM CEST #
Posted by Len Umina on August 05, 2007 at 02:43 AM CEST #
Big plans, but why no attention to fundamentals like current inability to use OTF fonts with PostScript outlines...
Posted by Aramis on August 07, 2007 at 06:04 PM CEST #
How do you plan to deal with fonts ? What if the original PDF embeds only part of the typeface, but the user edits the texts and add new characters ?
Anyway, even without complete font-preservation features, such an import capability would be quite interesting for me ! Can't wait to test it..
Posted by Benjamin on August 07, 2007 at 11:01 PM CEST #
As Aramis notes, it is true fonts do not embed properly when exporting to a PDF in Writer. The only workaround I know of is to print to a program such as the open source PDFCreator (http://www.pdfforge.org/products/pdfcreator). PDFCreator sets itself up as one of your printers and converts everything you send to it into a nice PDF file. It has also all the security features you expect. So far, this has proved for me a reliable workaround.
Posted by Odisseo on August 10, 2007 at 04:32 AM CEST #
hjgh
Posted by 202.131.231.122 on August 11, 2007 at 12:28 PM CEST #