Ryan Ackley

Subscribe to Ryan Ackley: eMailAlertsEmail Alerts
Get Ryan Ackley: homepageHomepage mobileMobile rssRSS facebookFacebook twitterTwitter linkedinLinkedIn


Related Topics: Apache Web Server Journal

Apache Web Server: Article

Unlocking Microsoft Office Documents

An open source alternative

If you've ever written software to be used by business managers, you will no doubt have received requests for interoperability with the Microsoft Office Applications. "Get me the report in Excel; HTML doesn't cut it and I need to run my own analysis on it"; "Can you index the zillion word documents I have so that the whole organization can search on them?"; "I have all this data in Excel; do I have to enter it again on this Web page?".... These are things we commonly hear as application developers, which is not surprising given the ubiquity of MS Office.

Does this mean you're forced to tie your application to Windows to interface with the COM APIs of Excel or Word? Apart from the fact that you don't want your language or platform decision to be constrained by a lack of choice, it's also important to note that these APIs can be unstable because they're automating a desktop application. Because of this, they are unreliable for any server-side deployment. For the Java developer, however, the power of Jakarta POI is close at hand.

POI is a pure Java application library for reading and writing the Microsoft OLE2 Compound Document Format (OLE2CDF) file formats. This format is used by (among others) various MS Office applications. As the name suggests, this is a format for storing multiple documents (or streams) in one file, for example, storing an embedded spreadsheet along with a presentation. Within this structure are stored the records that contain the application-specific data.

POI is structured along these lines. At its base it has a component known as the POIFS or the POI File System, which is the most complete implementation of the OLE2CDF structure in Java. Layered above this are the components to read the Excel record structures (HSSF) or the Word record structures (HWPF).

HSSF
HSSF is the component of POI that allows you to read, write, and manipulate Excel spreadsheets from pure Java applications. It consists of code that understands the Excel record formats, and wraps them up in an easy-to-use API.

How easy does HSSF make reading Excel files? See for yourself!


  InputStream in = new FileInputStream("data.xls"));
  HSSFWorkbook wb     = new HSSFWorkbook(in);
  HSSFSheet sheet = wb.getSheetAt(0);      // the 1st sheet
  HSSFRow row     = sheet.getRow(1);       // get the 2rd row
  HSSFCell cell   = row.getCell((short)1); // the 2nd cell of the 2nd row

The model of an Excel document in HSSF begins with the HSSFWorkbook object. This object provides access to the sheets (by name or number), which in turn provides access to the rows (HSSFRow) in the sheet. Each row provides access to the individual cells (HSSFCell) it contains.

From the cell object you can retrieve data contained in that cell via accessor methods, depending on the type of data. Listing 1 provides an example.

Given this object model, writing is equally simple. Instead of "get"-ing rows and columns, you "create" them and then "set" the values in the cells as in Listing 2. Once again, start with the HSSFWorkbook class, whose default constructor provides a new workbook object; then populate the workbook by creating a sheet in which you create rows. In each row create the cells you need. Finally, populate the cells with the data. As Listing 2 shows, a cell can contain integers, floats, strings, and dates.

Styles
All that is fine, but plain data is usually not sufficient to keep your users happy. HSSF therefore has a whole range of features designed to let you use a variety of styles and formats that Excel supports.

To start applying styles to cells, first create an instance of an HSSFStyle class:

HSSFStyle myStyle = wb.createCellStyle() // wb is an HSSFWorkbook object

The style object will now provide you with methods to set various style parameters, such as foreground and background colors, fonts, borders, and data formats, via conventionally named setters.

Data Formats
A key component of a cell's style is its data format. This specifies, for example, the number of decimal places in a number, or the format of a date. The data format is set using the setDataFormat method of HSSFStyle. This method takes an integer, which is an index to a format, since Excel keeps a list of indexed built-in formats (and user-defined formats are appended to this list and indexed in a similar fashion).

It's easy to get the index, however. For a built-in format, use the static getBuiltinFormat method in the HSSFDataFormat class. Give it the format string and it will return the correct index, the proper index for you. To set a format:

myStyle.setDataFormat(HSSFDataFormat.getBuiltinFormat("d-mmm-yy");

For a user-defined format, first get an instance of HSSFDataFormat from an HSSFWorkbook object to ensure that your format is registered with the workbook:


  HSSFDataFormat df = wb.createDataFormat();
  myStyle.setDataFormat(df.getFormat("dd%MMM%yyyy"));

If you don't want to worry about which formats are user defined (it's documented in the Javadocs for HSSFDataFormat), simply use the nonstatic method and it will take care of this issue internally.

When you have defined the style you want, just set it to the cell:

cell.setCellStyle(myStyle);

Reuse the same style object for cells that are similarly formatted - do not create new style objects for each cell, since Excel has an upper limit on the number of styles that can be referenced in a workbook. For example, you could create one style object for the table headers, one for the body, and one for the footer and use them throughout your spreadsheet.

Formulas
Probably one of the most important features of HSSF is the ability to populate cells with formulas. This allows you to create dynamic spreadsheets and facilitate the user's ability to change the data and perform her own analysis (which is indeed the power of spreadsheets, and the number one reason why you would want to output Excel files).

Formulas are created using the setCellFormula method of an HSSFCell object. The input to this method is a string containing the formula you want at that cell. It should be in the same format that you would type into the edit box in Excel (without a leading "="), thus:

cell.setCellFormula("A1+A2^2");

You could use any built-in VBA function, or even a user-defined function, in the formulas:


  Cell.setCellFormula("average(A1:B1)");
  cell.setCellFormula("mySpecialFunction(A1/A2)");

If you need to provide your users with the ability to copy-paste or drag an Excel formula in the resultant sheet correctly, you might want to use absolute references instead of relative. If formulas with relative cell references (the default, e.g., A1) are copied from one cell and pasted to another, the cell references in the formulas change relative to the destination cell.

cell.setCellFormula("A1/$A$25");

However, if the formula contains references that are absolute, they stay the same irrespective of the destination cell. Absolute references are specified by adding a $ symbol to the reference, viz. $A$1. Note that the row and the column can be individually addressed while specifying absolute references, viz. A$1 vs $A1.

You can also reference other sheets in the same workbook in the formula. HSSF does not yet support the ability to write formulas referencing external workbook files.


  Cell.setCellFormula("SUM(Sheet1!A1-Sheet1!A2)");
  // formula in cell A1 of Sheet2.

Note, however, that the formula results are not calculated by HSSF, which is really a file format reader and writer, not a functional replacement for a spreadsheet application. The formula is merely written into the file in the proper format and evaluated when the file is opened in Excel.

Finally
Among other advanced features, HSSF allows you to create merged cell regions. You can also set headers and footers for sheets, as well as set print areas, to ensure the data prints well. You can create split and freeze panes, set zoom options, or enable sheet protection. Additionally, you can create and manipulate named ranges. Later versions (see sidebar - A Guide to POI Versions) also let you programmatically create drawings in sheets.

However, there are always features of an Excel file that POI does not yet support. In such scenarios, templates are invaluable. The idea is to create an empty Excel spreadsheet populated with the attributes that POI doesn't support. You could, for example, create a chart in the spreadsheet referencing named ranges, or create a pivot table in a certain area. At runtime, in Java code, you could read the workbook in with POI and fill in the cells with data from your application. Now when the user opens the workbook in Excel, it comes loaded with data, charts, and pivot tables. Listing 3 provides an example.

Hopefully this overview of HSSF has convinced you that HSSF has almost all it takes to create professionally produced Excel spreadsheets that'll be a joy to your users, and leave them asking for more.

Word Documents with HWPF
The HWPF (Horrible Word Processing Format) component of POI is a Java library for reading and writing Word documents. It's still in early beta but is relatively stable and it is the only open source Java solution we know of for programmatically accessing and/or creating a Word document.

I am going to give a short introduction to the high-level structure of a Word document. These are basic concepts that can be applied to most styled document formats and they will make later sections of this article easier to digest.

A Word document can be modeled as a tree-like structure. Figure 1 illustrates this. The document has sections, a section has paragraphs, and a paragraph has character runs. Each instance of these is associated with a range of text.

  • A section can be correlated with a chapter in a book. A section contains obscure properties like the page border and the number of columns.
  • A paragraph follows the traditional definition of a paragraph. It contains more familiar properties that most Microsoft Word users know. The justification (left, center, right) and the indent setting are good examples.
  • A character run is a consecutive run of characters that share the same formatting. These contain the most common and visible properties. Some examples are font family, font size, bold, italic, and underline.
This provides you with enough information to use Java to read and manipulate this model. To get started, we have to create an HWPFDocument object from a physical Word file.


1 FileInputStream in =
2   new FileInputStream("C:\\test.doc");
3 HWPFDocument doc =
4   new HWPFDocument(in);

The Section, Paragraph, and CharacterRun classes represent the document tree that I explained earlier. I walk that tree in Listing 4.

First, I get the Range object for the entire document. This is the entry point to the object model. The Range class is an important piece of the HWPF API. It represents an arbitrary range of text in the document, with one to many sections, paragraphs, and character runs. The Section, Paragraph, and CharacterRun classes extend the Range class.

The methods numSections(), numParagraphs(), and numCharacterRuns() and the correlating getters are actually implemented in the Range class. Of course, if you call numSections() on a Paragraph object, it will return one. That would be the parent Section of that Paragraph object.

Another important method in the Range class is text(). This can be used to get the plain text for a particular range. To get the text for a document, use the following code:

String plainText = doc.getRange().text();

Once we have an instance of a Section, Paragraph, or CharacterRun object, we can read its properties by calling its various getters.


//Check the number of columns
//for this section
Section sect = r.getSection(x);
sect.getNumColumns();

//See if a paragraph is set to
//have a page break before it.
Paragraph par = sect.getParagraph(y);
boolean breakBefore = par.pageBreakBefore()

//Get the font name of a
//character run
CharacterRun run =
  par.getCharacterRun(z);
String font = run.getFontName();

These are quick examples. There are dozens of settings and there isn't enough space to cover them all. I encourage you to read the Javadoc to see what is possible.

Tables
Behind the scenes, tables are just a group of paragraphs with certain flags set. HWPF attempts to hide the juicy details but it still needs a little help (see Listing 5).

Listing 5 touches every paragraph in the document, looking for one with the table flag set. When it finds one, it passes it to the getTable method on line 8. Notice on line 12 that it's necessary to increment x so that the paragraphs that were part of the table aren't processed again.

Tables have TableRows, which in turn have TableCells. All these classes extend Range so you can use all the methods that I've already talked about for getting the contents of these entities.

Lists
Unlike tables, lists don't have a beginning and an end, because entries in a list can be inserted anywhere in the document and the list numbering can pick up wherever it left off. The ListEntry class is used to represent an entry, and it extends the Paragraph class. Look at how I get a list entry in the following example:


1 for (int x = 0; x < numPars; x++)
2 {
3   Paragraph par =
4     range.getParagraph(x);
5
6   if (par instanceof ListEntry))
7   {
8     ListEntry entry = (ListEntry)par;
9
10    //do something with the entry...
11  }
12 }

Adding New Content
There may be a time when you want to generate new Word documents or modify an existing document using Java. My first word of advice is to make sure that this is absolutely necessary. In most cases, a nonproprietary file format such as PDF, RTF, or HTML is the better choice. There are free libraries available for all of these. In the cases of RTF and HTML, the standard JDK provides the javax.swing.text package to manipulate the file formats. A rule of thumb for creating Word documents is: Will the eventual recipients of these documents want to edit them? If not, the PDF or HTML format is a better choice. If they do wish to edit them, consider using RTF instead of the Word file format.

The writing functionality of HWPF is somewhat experimental so expect some bugs and limited features. Modifying an existing document or creating a new Word document from scratch starts the same way - simply create a new HWPFDocument as shown in an earlier example. The only difference is that if you want to create one from scratch, you start with a blank document. The POI distribution comes with one called "blank.doc."

To commit any changes to a physical file and see what they do, you must write out the modified document. The following code writes out a Word document that contains any changes made to the original object model.


FileOutputStream docOut =
  new FileOutputStream(
     "C:\\testout.doc");
doc.write(docOut);

To be safe, I wouldn't recommend overwriting the original document. HWPF attempts to keep things that it doesn't directly support in the file, but this doesn't guarantee that they will be there when it writes the file out again.

The Section, Paragraph, and CharacterRun classes define setters that allow the various properties of existing content to be changed. The Range class defines the following methods for adding text and paragraphs to a document.

  • insertBefore(String text): Inserts a string into the document at the beginning of the Range. Assumes the properties of the character run at the beginning of this range.
  • insertAfter(String text): Inserts a string into the document at the end of the Range. Assumes the properties of the character run at the end of this range.
  • insertBefore(String text, Character-Properties props): Inserts a string into the beginning of the Range with the properties given by props.
  • insertAfter(String text, Character-Properties props): Inserts a string into the end of the Range with the properties given by props.
  • InsertBefore (ParagraphProperties props, int styleIndex): Inserts a new empty paragraph at the beginning of this Range. Based on the style at index styleIndex in the stylesheet.
  • InsertAfter (ParagraphProperties props, int styleIndex): Inserts a new empty paragraph at the beginning of this Range. Based on the style at index styleIndex in the stylesheet.
All of the insert methods return the Range that the insertion is now a part of. For example, when inserting a paragraph using insertAfter(ParagraphProperties props, int styleIndex), a Paragraph object is returned. Since Paragraph extends Range, all of the above methods can be used to fill this paragraph with text. The ParagraphProperties and CharacterProperties are similar to the Paragraph and CharacterRun classes. The difference is that classes ending with "Properties" are not associated with a location in a document. There are also SectionProperties and TableProperties.

The methods that insert a paragraph require a style index. Paragraphs and character runs store their settings as deltas from a style stored in the stylesheet. Styles provide a convenient way to maintain a consistent look and feel in a document. They also help a person creating a Word document through a user interface to be more efficient. To a programmer this may not matter. No matter what the style is, whatever properties are set for a particular Paragraph or CharacterRun object will appear in the document. I recommend just using the number 0 for a style index. This will always refer to the "Normal" style in the stylesheet.

Editing Tables
Because of the complexity, the range class does not currently define methods for inserting tables. However, the TableCell class extends Range, so all of the insert methods defined in Range can be used to add content to the individual table cells of an existing Table.

Adding Lists
Adding a list is a little tricky. Unlike most objects in the document, a list is not associated with a range of text. There are paragraphs that are associated with a list and these paragraphs are actual entries in a list. Before an entry can be added to a document, a list must be created. The following code creates a list.


1 HWPFList list = new HWPFList(true,
2   doc.getStyleSheet());
3
4 int listID = doc.registerList(list);

The HWPFList constructor takes two arguments. The first one is a boolean determining whether the list should be bulleted (if the argument is false, the list will be numbered), and the second is the stylesheet of the document to which the list will belong. The register List method that I call on the method on line 4 is defined in HWPFDocument. It returns a unique ID that's needed when adding a list entry to the document.

The Range class defines more insert methods for adding list entries.

  • insertBefore(ParagraphProperties props, int listID, int level, int style-Index)
  • insertAfter(ParagraphProperties props, int listID, int level, int styleIndex)
What is different from the normal paragraph insert is that both of the above methods require the list ID and the level. The level argument refers to the indent level of the list. At this point, the level argument is ignored because HWPF only supports writing simple, one-level lists. Figure 2 shows a screenshot of the Word document created using the code in Listing 6.

Summary
POI has its weaknesses. The biggest by far is the memory consumption in the Excel component (HSSF). The POI team has recognized this problem and is trying to address it in a coming release. The Word component's (HWPF's) biggest problem is that it isn't very mature. Right now it only provides very limited functionality. Even the Excel side of POI could use improvement on its support of some key Excel features, such as charting and images

If POI doesn't cut it, there is a wide selection of commercial libraries for working with Excel, such as SoftArtisans OfficeWriter. SoftArtisans (www.softartisans.com) is the only vendor I could find that also offers a product that can create Word documents in pure Java. OfficeWriter also supports every feature of Word and Excel.

With the new agreement between Sun and Microsoft, we may one day see the opening of the Microsoft file formats. While you wait for this day to come, POI provides a free open source alternative.

References

  • Apache POI: jakarta.apache.org/poi
  • WinCVS: www.wincvs.org
  • SoftArtisans OfficeWriter: officewriter.softartisans.com/officewriter-240.aspx

    SIDEBAR

    A Guide to POI Versions
    As an open source project, POI's development is carried out in a public repository by a group of volunteers. As a result, the code is quite dynamic, and this guide will help you navigate the multiple versions you'll find in the wild. In general, note that releases with beta, dev, or RC attached to their names are flagged as development releases, while releases without these postfixes are flagged as production releases.

    The 1.5.1 version released early 2002 was the preferred production version for a long time. But after a long series of new features, followed by a longer period of bugfixes and stabilization, the 2.0 version was released in January 2004.

    Subsequently, the 2.5 version was released in late February 2004 to incorporate a major new piece of functionality - the ability to create drawings in Excel sheets via what is known as the Escher Layer.

    Meanwhile, development had been ongoing in an experimental branch to enable the reading and writing of Word documents (HWPF). Unfortunately, it's necessary to download this piece of POI directly from CVS and compile it yourself. There are many excellent and free client applications for accessing CVS repositories such as WinCVS and jCVS.

    SIDEBAR 2

    Getting Started
    Getting started with POI couldn't be easier. Download the version you want from www.apache.org/dyn/closer.cgi/jakarta/poi/ as a zip or tar.gz archive. From the archive extract poi-<version>-<date>.jar. Add this file to your classpath and you should be set. POI has an optional dependency on log4j, but that's needed only if you turn on logging (which is disabled by default).

  • More Stories By Ryan Ackley

    Ryan Ackley is a developer on the Apache POI project. He works as a senior programmer for Space Gateway Support at the Kennedy Space Center, FL. He is also a Master’s candidate for computer engineering at the University of Central Florida.

    More Stories By Avik Sengupta

    Avik Sengupta is a committer on the Jakarta POI project, and is chief technology officer at Itellix Software Solutions.

    Comments (3) View Comments

    Share your thoughts on this story.

    Add your comment
    You must be signed in to add a comment. Sign-in | Register

    In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


    Most Recent Comments
    jettimadhu 03/10/10 12:36:00 AM EST

    Hi,
    I want read an images from MS word .doc file and place in a specified position in same format and place.
    is it possible ?

    Please reply to this..........

    c.ponnuchamy 10/20/08 07:35:51 AM EDT

    i dont know how to get POI.jar(which contains HWPF Class ) please help me to download it)

    Piers 08/06/04 06:11:22 PM EDT

    Nice one Ryan & Avik, I''m sure this will generate a great deal more interest in Apache POI