GullFOSS
OpenOffice.org Engineering at Sun
 
 
 
 
More Flickr photos tagged with openoffice

Today's Page Hits: 1895

Locations of visitors to this page
« New PackageInformati... | Main | Announcing tools.ser... »
Wednesday, 01 Aug 2007
Completing PDF support in OOo (Part II)
Kai Ahrens

Ok, now that we let the cat out of the bag, my inbox is filled with some mails asking for more information on the PDF import filter we're going to implement. So, I'd like to give you some details that are yet known, but still discussable if somebody comes up with a better idea:

  • As already mentioned in my comment regarding the initial blog entry, it won't be an option for us to import the PDF content into a Writer document containing floating text and as such a floating layout. So, we decided to write a filter that imports the PDF content as OOo Draw/Impress document.
    With this solution, we'll have the full benefit of a page orientated, fixed layout. All graphical elements will be at fixed positions given in the PDF file and text portions will be combined as most as possible to be anchored in text shapes, ensuring that text portions preserve their exactly given position, but are still editable by the user.
    The challenge with this solution is 'just' to find the most common bounding box for text portions that can be grouped together in one text shape. But this is nothing compared to the 'impossible' and life time task of reconstructing/guessing the whole layout of the original document the PDF document was created from. As you know, PDF files don't contain such structuring information in general, beside some tagged PDF files, on which we can't rely.

  • The next question that arises for development is, what kind of parser to use for reading the basic content of the PDF file. There exists a well known and widely used framework for this: the XPDF library and its derivatives like Poppler. Yeah, that would be a great and well tested framework for us, but unfortunately, it doesn't match with the OOo code licensing, at least at the moment. So, we'll have to write our own parser for this task, which is not bad at all due to the fact that XPDF still lacks some features we would have to implement in either case.

  • The filter itself will be available as a downloadable extension to the standard OOo release. This perfectly fits in our roadmap to create a more unitized OOo packet, consisting of several 'standalone' components, reusable in other context.

  • The most interesting question that came up is that of the timeline for this implementation. Please expect to have the product version of the filter ready for the OOo 3.0 release latest. A detailed release plan for OOo 3.0 is not known at the moment. But, as already mentioned, I expect to have first results available within a few months, so that most of you will be able to enjoy playing around with a pre-release of this filter till the end of this year. We will definitely need your feedback regarding this first release and upcoming ones to add missing parts, fix bugs etc.

  • Some of you asked, if there will be some additional goodies around the whole PDF story in OOo. The answer for this question is 'Yes, there will be some more stuff around the pure import and export filters'. One example for this would be the support for PDF/A, a feature that is currently implemented by community member Giuseppe Castagno.
    Another example would be the support for creating PDF documents containing the original ODF document itself, allowing to read the original content without loss by any ODF enabled application.

I hope that this blog entry answers the most urgent questions for the moment. Please don't hesitate to add any comments, questions, suggestions etc. you have.


tags:

Posted by Kai Ahrens on 01 Aug 2007  |  PermaLink |  Bookmark to Delicious To Delicious |  Digg this Digg this  |  Comments[12]

Comments

Jörg Weidinger said: Regarding XPDF and/or Poppler I hope to see a kind of agreement regarding licence issue. It's "sad" to see that open source software can't meet one of it's often cited benefits (meaning the benefit of "sharing and progressing together"). There are other projects which would directly benefit from such collaboration. A short list can be found here: http://freedesktop.org/wiki/Software/poppler . Abiword and KDE are not mentioned there. However, the PDF import feature in OOo is an exiting news to me.

Posted by Jörg Weidinger on August 01, 2007 at 07:23 PM CEST #

Mike said: Import PDF to Draw/Impress document is not the feature users expected. Most of them need to import only text (might be with images and tables) and ignore original PDF formatting. Transforming PDF into set of Draw "images" or slides is almost meaningless in this case.

Posted by Mike on August 02, 2007 at 07:09 AM CEST #

Kai Ahrens said:

Joerg: I'd really like to find a solution with XPDF/Poppler and don't have any interest in writing a new parser for this task. I hope there will be a way to use it, but with the current licensing models, it's just not possible, excluding an 'Out Of Process' solution with a little GPL part around it for IPC usage. Since this would be a nasty solution, I don't want to go this way.

Mike: Right. The scenario your'e mentioning was also discussed within our meetings and we did considered it a valid scenario for many users. In a perfect world, one would expect to get the PDF document opened in Writer with a layout as same as in Acrobat Reader but with full editing capabilities. Since this is not possible for the reasons I already described, there are many possible, but always inferior solutions, to address this. We decided to go the layout preserving way first, which is only possible with a page orientated and position preserving solution. You'll still be able to extract the textual content, since this will be included in the Draw/Impress document as editable/copyable text shapes. This also doesn't mean that there won't be a second or third solution that exactly fits yours and others needs. It will be a really small effort to convert the created Draw/Impress document to a Writer document just containing the text, using an XSLT for example. So, there will be further addons available, that will perfectly match your request. It's just a matter of priorities.

Posted by Kai Ahrens on August 02, 2007 at 10:09 AM CEST #

Chris Puttick said: @Mike: surely the PDF import facility you describe already exists? Select all, cut and paste. That gets the text and graphical contents without preserving formatting (of any kind!). On the otherhand, being able to preserve formatting, whether in a word processor or "just" into a graphical editor/presentation tool is as near to Adobe Professional as you can get. Gets my vote...

Posted by Chris Puttick on August 02, 2007 at 09:02 PM CEST #

Mike said: I've just considered usefulness of the filter. Let suppose it works like a charm. What user can do with it? Import PDF, edit something, and (ideally) export back into PDF. Well, this ability alone is a nice addition for OOo, and I would be happy to hear its other useful applications. Extracting text data from presentation will be even more difficult than from PDF viewers (Adobe Reader, KPDF), because each page need to be processed separately. I'm not against the proposed implementation, but ability to import only data and strip most of formatting will greatly enhance filter usability. PS. My experience with PDFs shows that more than 90% of them are "text" docs.

Posted by Mike on August 03, 2007 at 09:01 AM CEST #

Patrick said: Will the PDF/A feature be released before the release of the import filter? IMHO, this feature is much more essential for OOo.

Posted by Patrick on August 03, 2007 at 10:02 AM CEST #

Jitin said: It is a great option but can it be able to import all the formatting in Open Office

Posted by Jitin on August 04, 2007 at 09:56 PM CEST #

Len Umina said: I'd really like a complete replacement for Acrobat of course, but in Open Office my biggest problem is having to import pdf documents into another document either as a chapter or for reference or just to include them. For example I have a huge library of PDF documents created with Adobe Acrobat. I'm writing a letter and I want to include a 5 page PDF in the middle because it's pertinent. I want to make sure the communication includes the entire transfer, and I often want the page headings etc from open office to surround it on the same page. What I do now is go to Acrobat, run export jpg images, and then import these images one at a time. It's very slow, but it does the job. I have not found a better way. (if you know one please let me know). Now my PDF file pages are inside boxes that can be scaled to fit the page. The application is simple - legal stuff. Motions, exhibits, analysis documents are very nicely referenced and recorded this way. Somtimes, I'd just like to have a pdf document print inline with an open office document and just be able to superimpse on it a page number string. Another application is document production. If you need to produce 10,000 pages from PDF files, it can get really difficult - having to put bates numbers on each page etc.... So far, I've had to write perl scripts to do the job, but there is no reason at all that open office via a master document shouldn't be able to do this. Then it would be very clear what the document list submitted was, it could be easily reproduced, and you don't need a software guru to pull it all together. /Len

Posted by Len Umina on August 05, 2007 at 02:43 AM CEST #

Aramis said:

Big plans, but why no attention to fundamentals like current inability to use OTF fonts with PostScript outlines...

Posted by Aramis on August 07, 2007 at 06:04 PM CEST #

Benjamin said:

How do you plan to deal with fonts ? What if the original PDF embeds only part of the typeface, but the user edits the texts and add new characters ?
Anyway, even without complete font-preservation features, such an import capability would be quite interesting for me ! Can't wait to test it..

Posted by Benjamin on August 07, 2007 at 11:01 PM CEST #

Odisseo said:

As Aramis notes, it is true fonts do not embed properly when exporting to a PDF in Writer. The only workaround I know of is to print to a program such as the open source PDFCreator (http://www.pdfforge.org/products/pdfcreator). PDFCreator sets itself up as one of your printers and converts everything you send to it into a nice PDF file. It has also all the security features you expect. So far, this has proved for me a reliable workaround.

Posted by Odisseo on August 10, 2007 at 04:32 AM CEST #

Someone said:

hjgh

Posted by 202.131.231.122 on August 11, 2007 at 12:28 PM CEST #

Post a Comment:
Comments are closed for this entry.
« New PackageInformati... | Main | Announcing tools.ser... » GullFOSS