Report Emitter to Word Document
This project aims to provide support for exporting a report to Word document format.
Proposed May 30, 2006
Posted in Wiki format, July 17, 2006 -- Wenfeng Li
- Specification Lead
BIRT 2.0 currently supports PDF and HTML exporting formats. This proposal adds an additional format for exporting to word processors, and particularly the Microsoft WordprocessingML format. Word XML format has been available since Word 2003. An extended XML format for Word 12 has been submitted to standard body (ECMA) for standardization.
Since the XML format is clearly the future direction for word processors, we propose to use it as the exporting format for Word instead of the out-dated RTF.
The design for Word emitter tries to achieve the following objectives:
- Capture the report presentation and layout as closely as possible.
- Enable easy editing of the resulting document.
Due to different layout models between BIRT and Word documents, some presentation and layout attributes may not be supported in the target format. In those cases, the emitter should try to simulate the similar effects through other supported constructs.
A major reason for exporting to Word document is so the users can edit or copy information from the exported file. The emitter implementation will try to maintain a structure for easy editing and copying. In Word documents, contents flow in a main document flow, and any changes in the contents causes the contents below to move up or down with the modification. This layout model matches BIRT report flow fairly well.
One obvious mismatch is the use of grid in BIRT as the main layout manager. Word documents don’t have a direct equivalent feature. We will simulate a grid using a table that should produce the same effects. Other issues may rise upon further investigation. For example, HTML is not directly supported in Word documents, but it is allowed in text element. In case an element can’t be mapped to an equivalent document type, we will workaround the problem by inserting a rendered image in the place. However, this type of workaround should be minimized and only used if no known solution is found. It is likely there will be trade-offs when a report element can’t be mapped directly to the document while retaining both the look-and-feel and the ability for further editing. In those cases, decisions should favor enabling user editing instead of rigidly capturing the layout and presentation attributes.
Components and Structures
Unlike PDF, Word documents don’t offer precise layout control. The placement and layout of the contents are controlled by the document flow, and is not by absolute positions. As stated earlier, the emitter will try to map the report layout (grid) to document tables, but leave the actual layout to be done by the word processor. Consequently, the Word emitter doesn’t include a layout component as in PDF emitter.
A thin file format layer will be added to handle the actual output of the WordprocessingML. The format layer provides a simple set of API for outputting the XML elements defined by WordprocessingML, such as paragraph, table, and image. It is not aware of the internal report structure and serves as the low-level I/O in the exporting process.
The emitter handles the conversion from ROM to WordprocessingML. Given the main intended usage as producing editable document, the conversion process should concentrate on capturing the report contents and fit them nicely into the document flow. User explicitly defined layout such as page breaks should be respected as much as possible, while the actual calculation of the content placement is left to the word processor.
The format layer API should be as generic as possible, and should allow the emitter to be extended to handle additional output format in the future. Specifically, the API should given consideration to the Open Document standard, which is likely a candidate for future enhancements.
ROM to Document Mapping
- Text elements:
All text elements, including label, text, dynamic text, and data, are exported as paragraphs, text boxes, or table cells, depending on the container of the element. All formatting attributes should be supported.
Images are embedded in the output document as a bitmap. The exact format depends on the WordprocessingML specification. Compression should be performed if allowed by the format. If a image is outputted multiple times, only one image should be embedded in the document provided the ROM contains enough information to identify the uniqueness of the images.
Charts are rendered into and embedded as images.
All container elements, including grid, table, and list, are exported as tables.
Word emitter only support user specified hard page breaks. User defined page breaks are converted into document hard page breaks. If a page break is defined on an element inside a grid, the page break needs to be pushed up to the grid level, and the emitter needs to split the grid into multiple tables with a page break in between. The same splitting should be performed for tables where page breaks are defined in the middle of a table.
Due to the limitation of document format, pagebreaks defined in nested table or table inside grid may not be supported. For example, if a page break is defined in a grid cell, but the cells on the same row are spanned across multiple rows, so the grid can’t be split at row level without changing the layout of the other cells. The decision is pending further investigation of the WordprocessingML.
Hyperlink to external URL (both Web URL and report link) should be supported using the standard hyperlink format. Hyperlink to bookmarks within the same report should also be supported. Links to other reports are to be processed similar to PDF emitter and always points to the HTML output.
All text contents are exported to the XML documents using standard UTF/UTF8 encoding. It’s up to the word processor to handle the actual rendering of the text. This implementation will not support font embedding so the exported file may require certain fonts to be installed on a system to be viewed. Given the extensive support by OS and office suites, this should pose little problem in practice.
Bi-directional text will not be treated differently. The text should already come in its natural order, and the characters are outputted to the document in the same order. It is assumed the bi-directional handling of the report engine is compatible with the document bi-directional processing.
Word emitter should support all BIRT supported image formats. Image exporting should embed the original image file if the image format is supported by WordprocessingML, and the original image file is available. Otherwise, the image should be converted to a file format supported by WordprocessingML by first loading the image into a Java image object, and then output the image to the file supported by Word. The exact image format is to be determined pending further investigation of the specification.
Table of Contents
TOC in BIRT report should be translated to document table of contents. Due to the requirement to leave the actual layout of the document (pagination) to the word processor, the emitter would not have the accurate page number at the time of exporting. As a result, the generated TOC may not contain the correct page number. Users may need to manually update the TOC to obtain the correct page numbers.
A possible workaround for this problem is to add an active tag or script to update the TOC when the document is opened. This needs to take into consideration of whether the mechanism is standard and universally supported by all versions of Word. This option should be pursued after the main functionality is in place and is regarded as an optional feature.
Headers and footers are exported to document headers and footers. Floating footers require pagination in the emitter, and are not supported. All floating footers are treated as regular footers during exporting. Page numbering in headers and footers is converted into corresponding tags. Similar to table of contents, the number numbering and page total is not known by emitter, and so may not contain the correct value after exporting. A similar technique for automatically updating the information up-on opening the document should be considered.
Since the emitter will not perform actual layout of the report, the performance should not be an issue. The memory requirement for the emitter should be small. Contents should be streamed and not held in memory.
There are two options for integrating Word emitter to the IDE. Currently PDF and HTML are integrated into the IDE as preview options. An export button is also present on the previewer for exporting the previewed report. We feel Word exporting fits better as an export option instead of previewing. To export to Word, a Word export option can be added to the export button.
Another option is to add Word exporting to the Eclipse export dialog as an export wizard. However, the Eclipse export seems to be used with the purpose of dealing with project, and there seems to be inconsistency in terminology if Word exporting is added as an option there. This may require more consultation and taking into consideration of the overall framework and integration philosophy.
A new option for exporting a report to Word format should be added to the servlet request API.
Related Bugzilla Entry
The proposal document has been attached to the Bugzilla request . Please post and comments and feedbacks to this Bugzilla entry.