GullFOSS
OpenOffice.org Engineering at Sun
 
 
 
 
More Flickr photos tagged with openoffice

Today's Page Hits: 2881

Locations of visitors to this page
Monday, 20 Jul 2009
XML Performance, and now for something completely different...
Christian Lippka
While Armin Le Grand did a great job at improving the load/save Performance for presentation documents by tweaking the application core itself, I took a step back and thought about performance improvement by using different technologies. The first step was to look at other techniques to deal with xml documents that would have one or more advantages over the current implementation. Since according to Helmuth von Moltke "No battle plan survives contact with the enemy" I had to test my assumptions and so my theoretical work on this resulted in a  prototype import filter implementation for impress. This is a short  summary of what techniques I looked at, why and how I used them in the prototype and the interesting results I got from it.

Mission Statement

The mission of this prototype was to gather data how the utilization of new technology could enhance the performance of the native OpenDocument Format (ODF) filters for OpenOffice.org (OOo). The focus of this prototype was to first look at the import of impress documents and achieve the following three goals

Goal 1 : improve overall filter performance

Goal 2 : make use of multiple cores/cpu through threading (where threading is not blocked constantly by calls to the OOo core which is currently not supporting multiple threads without blocking)

Goal 3 : support the implementation of load on demand (for example to display the first slides to the user and load the rest of the document in the background)

Current state

Currently ODF is imported with a SAX based filter that is accessed over the UNO API. Therefore the current SAX parser pushes notifications about the xml elements to the current ODF filter implementation in the order they appear in the xml stream. The filter itself has no control on choosing which elements to parse first or to postpone the parsing of the current element for a later time. It makes it also nearly impossible to measure the time spend for the current xml parsing because this tight coupling between the SAX parser and the ODF filter.

XML streams from ODF documents are usually encoded as UTF-8. The UNO API uses strings in UTF-16 encoding. Therefore, the SAX parser converts all strings from the xml stream to UTF-16 which is used by the UNO string implementation. To identify xml element and attribute names, expensive string compares are conducted.

Between the ODF filter and the current OOo application core is another UNO API layer. The filter has to convert the xml events to something  that can be send over this layer to the OOo application core. In most cases the implementation of the API layer must also convert the given data to the format used in the applications core.

the detailed flow of data is currently as follows
  1. SAX parser reads xml data of utf-8 streams inside zip storages
  2. SAX parser UNO implementation handles namespaces and converts element names, attribute names, attribute values and text content to utf-16 strings and feeds the ODF filter implementation
  3. ODF filter implementation transforms utf-16 xml data to UNO representations for the OOo UNO API
  4. The applications UNO API implementation transforms the UNO data to a core data representation

Assumptions

Alternative technology

XML Pull Parser (XPP)

XPP is a streaming pull XML parser. Unlike sax where the sax parser calls the filter, the filter itself makes calls to the XPP parser to parse the next xml element.

+ In contrast to sax, the filter can interrupt sax parsing after any element and continue parsing later.
+ Pull instead of Push leads to a cleaner filter implementation (cleaner code is usually easier to service and improve).

- Performance is equal to a sax parser
- No random access

Document Object Model (DOM)

A DOM is a parsed memory representation of a xml stream. This technology is used by modern browsers and the odftoolkit.org project.

+ Filter has random access to all xml elements
+ Random access leads to a filter implementation (cleaner code is usualy easier to service and improve).

- A DOM has to store a complete copy of the xml stream + management in memory during the filter process.

Fast sax parser

I developed the fast sax parser during the initial implementation of the Office 12 XML filters. It is basically a sax parser but uses integer tokens to represent known namespaces, element names and attribute names. Tools like the gnu gperf can be used to create perfect hash code to transform the xml names to integer tokens. Scripts can parse the dtd or relax ng of an xml format to automatically extract all the xml names that needs to be convertable to integers. With utf-8 xml streams, the tokens can be created without the need to transform the strings to another encoding first. Namespaces can be combined with xml names so each element or attribute name with a namespace can be identified with just one integer compare.

+ Reduces string handling (encoding, comparing, storing)
+ Leads to cleaner filter implementation (f.e. switch statements can be used to identify child elements instead of if .. else .. blocks which use the string compare functions)
+ Usage of perfect hash algorithms which are automatically created during compile time

- No random access

The Prototype

Since random access looks like the key to have a filter that supports painless load on demand, I decided to go with a DOM solutions. To minimize the memory footprint of the DOM tree, I used the fast sax parser to build the DOM tree and use the integer tokens for xml names instead of the strings from the stream. Now parsing the xml  stream itself does not need any interaction with the application core so this could be done in a separate thread. Converting the xml representation of attribute values to an UNO  representation is also something that could be done in almost all cases without the application core, so this should be done in the same thread.

The problem here is that a classic DOM is a generic and typeless representation of the xml data. The solution here is to use a technique we introduced in the odftoolkit.org project. Initially a xslt transformation was used to create dom node implementations for each element of the ODF format with type safe access to its attributes using
only the relax ng. Upkeeping xslt templates proved costly for such complex operations. So this was replaced with a code generator that I implemented in java which uses simple templates and configuration files to transform a relax ng schema to code files.

This code generator also allowed to create DOM tree element implementations for other languages than java which is used in the  odftoolkit.org project. Therefore I used the generator to create c++ source files for all ODF elements and I adapted the configuration files to create types for the attributes that are equal to the UNO types that the filter needs to pass to the application core. (The key difference between the ODFDOM from odftoolkit.org and the DOM tree for the prototype is that the former uses ODF based types for the attributes and the later uses UNO types).

So in conclusion, a DOM tree builder is started in a worker thread and parses the xml stream by using the fast sax parser (The prototype actually starts two worker threads, one to build the tree for the styles.xml stream and one for the content.xml stream). It uses the sax events to create a tree where each element is the instance of a class that was specifically generated for that element from the relax ng schema. All attribute values are parsed and stored into an UNO Any with preferable the same type as used in the UNO API of the OOo application. So for example if the attribute is a length value then something like "12cm" is converted into an UNO Any containing an integer value of 1200 (12cm converted to 1/100th mm).

This worker threads would never block or wait for the OOo thread. But this would not make sense if at the same time the office thread idles and waits for the tree builder to finish. So the tree builder notifies the filter as soon as an imported element has been fully parsed. For example if the office:styles element is completely parsed the filter thread can start and import the styles. If the filter gets notified that a slide has been completely parsed, it can check if the needed styles and master pages are already imported and then import this slide. If the filter is also executed in a separate thread, the office thread can paint the already imported slides to enhance the responsiveness of the application to the user which results in a 'subjective' performance gain

To implement the filter I borrowed code from the existing ODF filter and transformed it to use the DOM tree instead of sax events. This mostly resulted in much less and cleaner code, as expected. For a reliable comparison with the existing filter I had to implement a minimum set of functionality so that for selected real world documents the prototype imports all the functionality available in that document.

I ended up implementing

Results

First I used an average real world document with 47 slides and lots of graphics and some chart ole. It showed that the prototype filter was around 2% faster than the original filter. This was less than expected.

Next I created an artificial document. A presentation with 188 slides and only formated text, no graphics, no ole. This pure xml document would be uncommon for a presentation but a good approximation what the gain could be for a writer or calc document where the xml to graphic/ole ratio is much higher. This lead to a performance gain of 10% which is not bad since the prototype is not yet profiled and optimized itself.

I tested this on a dual core 3Ghz system. Experiments with the original filter showed that we are cpu bound and since we use only one core, the processor usage is only 50%. So I expected better results with the threaded prototype by making use of the 3Ghz from the second core. A quick look at a cpu monitor showed that this didn't happen, processor usage was still capped at 50%.

This made me suspect that the actual parsing, tree building and xml to UNO conversion did only account for an insignificant amount of the time the filter needs to import this document. After removing the threading I measured the time it took to parse the xml streams and to actually import the document. It turns out that building the DOM tree for the styles.xml and content.xml accounted for less than 2% of the overal time. So when opening a document that takes 10 seconds to load, the second core is only used 0,2 seconds. Remember that this does not only include the actual xml parsing but also creating the DOM tree in memory and converting most of the attribute values to UNO types.

The interesting finding here is that the overhead from the xml parsing is much less than the typical 'developer gut feeling' about xml. So any performance work in the filter or the application core  would be much more efficient then trying to speed up the xml parsing.

Since the usage of DOM is often criticized for its memory consumption, I also took a look at this. While I replaced most strings with 32 bit integer tokens I figured that the memory consumption of a DOM tree should not be greater than the xml stream it was parsed from. My expectation was that for most cases it should be even smaller.

The first measuring showed that it was actually 10 times more than its xml stream. After further investigation I found the source of the problem which is the overhead for each allocated instance from the memory manager.After converting attributes from individual instances to a single vector this dropped to 2x the size of the xml stream. For an impress document this is not a problem as the xml stream size for average documents is seldom more then 1 MB. For a writer document this may also be no problem as for example the huge OpenDocument specification has less than 10MB of xml streams. For calc this may be a problem as calc document with many cells can have xml streams of 100MB and more. For current office  workstations this may not be a problem at all, but if OOo runs on a server for multiple users or if OOo would be ported to small devices then this could become an issue.

Conclusion

While the prototype showed that this method of implementing an ODF import filter does result in a performance gain and the option to support load on demand, for an impress application the performance gain of only around 2% for real world documents is out weighted by the actual cost to implement this filter. I currently estimate an effort to at least 6 man month for implementation of only the impress import filter (this is without time for testing which is also crucial to find regressions). For a calc and writer filter these figures may differ.

Since writer and calc are more xml centric then impress documents, a prototype for these applications may show that the overall gain still does out weight the costs.




tags:

Posted by Christian Lippka on 20 Jul 2009  |  PermaLink |  Bookmark to Delicious To Delicious |  Digg this Digg this  |  Comments[8]

Wednesday, 19 Dec 2007
PDF Import: First milestone reached
Thorsten Behrens

We've now reached a first milestone with the PDF import extension, in that we're able to import typical PDF documents with good layout fidelity in Draw and Impress. Below, you can see a sample CAD pdf, imported in Draw, with the title slightly edited (it read "Airplane Engine" before):

Sample import screenshot

When exporting this back to pdf, or printing it, this enables users to perform basic editing in formerly read-only PDF. Equally possible is filling out forms (though not overly convenient, as of now - the native pdf forms are not yet imported). 

The next steps are probably improvements in the Writer import, which focuses more on editability, and less on layout. The legal issues with an external pdf parser mentioned before are solved, so everything necessary to check this out is now available in CVS, as CWS picom.

tags:

Posted by Thorsten Behrens on 19 Dec 2007  |  PermaLink |  Bookmark to Delicious To Delicious |  Digg this Digg this  |  Comments[3]

Wednesday, 01 Aug 2007
Completing PDF support in OOo (Part II)
Kai Ahrens

Ok, now that we let the cat out of the bag, my inbox is filled with some mails asking for more information on the PDF import filter we're going to implement. So, I'd like to give you some details that are yet known, but still discussable if somebody comes up with a better idea:

  • As already mentioned in my comment regarding the initial blog entry, it won't be an option for us to import the PDF content into a Writer document containing floating text and as such a floating layout. So, we decided to write a filter that imports the PDF content as OOo Draw/Impress document.
    With this solution, we'll have the full benefit of a page orientated, fixed layout. All graphical elements will be at fixed positions given in the PDF file and text portions will be combined as most as possible to be anchored in text shapes, ensuring that text portions preserve their exactly given position, but are still editable by the user.
    The challenge with this solution is 'just' to find the most common bounding box for text portions that can be grouped together in one text shape. But this is nothing compared to the 'impossible' and life time task of reconstructing/guessing the whole layout of the original document the PDF document was created from. As you know, PDF files don't contain such structuring information in general, beside some tagged PDF files, on which we can't rely.

  • The next question that arises for development is, what kind of parser to use for reading the basic content of the PDF file. There exists a well known and widely used framework for this: the XPDF library and its derivatives like Poppler. Yeah, that would be a great and well tested framework for us, but unfortunately, it doesn't match with the OOo code licensing, at least at the moment. So, we'll have to write our own parser for this task, which is not bad at all due to the fact that XPDF still lacks some features we would have to implement in either case.

  • The filter itself will be available as a downloadable extension to the standard OOo release. This perfectly fits in our roadmap to create a more unitized OOo packet, consisting of several 'standalone' components, reusable in other context.

  • The most interesting question that came up is that of the timeline for this implementation. Please expect to have the product version of the filter ready for the OOo 3.0 release latest. A detailed release plan for OOo 3.0 is not known at the moment. But, as already mentioned, I expect to have first results available within a few months, so that most of you will be able to enjoy playing around with a pre-release of this filter till the end of this year. We will definitely need your feedback regarding this first release and upcoming ones to add missing parts, fix bugs etc.

  • Some of you asked, if there will be some additional goodies around the whole PDF story in OOo. The answer for this question is 'Yes, there will be some more stuff around the pure import and export filters'. One example for this would be the support for PDF/A, a feature that is currently implemented by community member Giuseppe Castagno.
    Another example would be the support for creating PDF documents containing the original ODF document itself, allowing to read the original content without loss by any ODF enabled application.

I hope that this blog entry answers the most urgent questions for the moment. Please don't hesitate to add any comments, questions, suggestions etc. you have.


tags:

Posted by Kai Ahrens on 01 Aug 2007  |  PermaLink |  Bookmark to Delicious To Delicious |  Digg this Digg this  |  Comments[12]

Monday, 30 Jul 2007
Completing PDF support in OOo
Kai Ahrens

Having a very well working and mature PDF export filter in OOo for several years now, it's time to take the final steps regarding full PDF support. Yes, you're right, we're speaking of implementing a native PDF import filter for OOo within the Sun OOo Graphics development team.

As trivial as this task might look like at the moment, there are several topics that need to be discussed in detail before development can be started. This begins with the OOo application, that the import filter will be written for and definitely doesn't end with the appropriate parser that will be used to read the PDF content itself.

I don't want to go into details of the current planning and development phase by now, but please be assured that the final solution is planned to be a total replacement of the currently available tools you normally use in your everyday workflow, preserving the layout as good as possible plus offering editing capabilities for the imported document, a feature that you don't get for free with most of your common tools. Sounds great, doesn't it?

I don't want to be too optimistic, but we're planning for the first prototype to be available within the next few months. Please stay tuned for more details to be provided by the involved development team members within the next days...




tags:

Posted by Kai Ahrens on 30 Jul 2007  |  PermaLink |  Bookmark to Delicious To Delicious |  Digg this Digg this  |  Comments[16]

Thursday, 26 Apr 2007
New MS Word Filter for Writer (Milestone 1)
Henning Brinkmann
We are announcing milestone 1 of the new MS Word Filter for StarOffice. It can be downloaded from here:

http://ooo.services.openoffice.org/pub/OpenOffice.org/cws/upload/writerfilter2/


Available platforms are:

  • Windows

  • Linux

  • Solaris(SPARC)


The following features are covered by this milestone:

  • reading text

  • reading character attributes

  • reading paragraph attributes

  • reading fields

  • reading styles


In the next milestone we plan to provide support for reading tables and section properties.


The following people are currently working on this filter:

  • Henning Brinkmann (Sun Microsystems)

  • Oliver Specht (Sun Microsystems)

  • Fridrich Strba (Novell)


Your feedback is welcome on this mailing list:

mailto:dev@sw.openoffice.org

You can subscribe to the mailing list here:

http://sw.openoffice.org/servlets/ProjectMailingListList


For more information about the filter project, please read this post:

The New Microsoft Word Filter for Writer

or visit the OpenOffice.org wiki page about the filter:

http://wiki.services.openoffice.org/wiki/WriterFilter



tags:

Posted by Henning Brinkmann on 26 Apr 2007  |  PermaLink |  Bookmark to Delicious To Delicious |  Digg this Digg this  |  Comments[1]

Monday, 27 Nov 2006
The New Microsoft Word Filter for Writer
Henning Brinkmann

Let me introduce myself first. My name is Henning Brinkmann. I am working for Sun in the StarOffice/OpenOffice.org team in Hamburg, Germany.

My current responsibility is the Microsoft Word filter in Writer (the text processing application in OpenOffice.org). This includes maintaining the current filter as well as the filter component we are developing at the moment.

This weblog entry will give a brief overview of the new filter for Microsoft Word formats to be used in Writer. It covers the motives for implementing a new filter and the concept used. Further entries will deal with more details of the concept.

Why a new filter?

There are several file formats, that originate from the core of Microsoft Word and are supported by StarOffice/OpenOffice.org:

  • binary Word formats (sw/source/filter/ww8)

  • RTF (sw/source/filter/rtf)

  • WordProcessingML in Microsoft Office 2003 (via XSLT)

For each of these formats there is one separate implementation in OpenOffice 2.0. The result of this quite clear. The feature sets supported are different. If a new feature is to be implemented, there are several filters to do the implementation in. Bugs can appear in each of the filters.

Investigating further reveals that all of the formats mentioned above share common structures, i.e. the structures of the Word core. This fact lead us to the concept described in the following. It shows how we want to solve the problem of implementing common properties several times in the new filter.

The Concept

In principle a filter maps between domains. These domains are the document spaces containing all possible documents of the formats that the filter maps between. In this terminology a filter can also be called a domain mapper.

The core idea in our concept for the new filter is shown in the figure below. Instead of having a domain mapper for each source format, there is an intermediate format that is the input for only one domain mapper.




The domain mapper gets its input from a tokenizer specific for the source format. The tokenizers abstract from the source formats and deliver the documents content in the intermediate format to the domain mapper. Having one domain mapper means implementing mapping of one feature one time and thus reducing maintenance efforts.

The document in StarOffice/OpenOffice.org is generated using a new import API. This way it is possible to uncouple the new filter from the core and deploy it as an UNO component.

The path in the figure marked in blue color shows where we are concentrating our current efforts. The other paths are planned to follow later.

Where Do I Get the New Filter?

If you want to have a look at the current state of the filter you can find information how to retrieve and install the filter component here:

http://wiki.services.openoffice.org/wiki/WriterFilter

How to Help

If you like to participate in our efforts for the new filter, please feel free to contact me at

Henning.Brinkmann@sun.com



tags:

Posted by Henning Brinkmann on 27 Nov 2006  |  PermaLink |  Bookmark to Delicious To Delicious |  Digg this Digg this  |  Comments[1]

GullFOSS