Welcome to MSDN Blogs Sign in | Join | Help

Brian Jones: Open XML Formats

I'm Brian Jones, a program manager in Office. I've been working on the XML functionality and file formats in Office for about 5 years now. In this blog, I'll mainly focus on XML in Office and the Open XML File Formats coming in the 2007 Microsoft Office system.
Week off

As I just mentioned in a comment I left in last week's post, I'm actually attempting to take a bit of a break this week. We actually wrapped up the development work on Office 2007 last Friday. It's been a huge release, and it's really exciting to see it finally go out the door. I've been working on these file formats for years now, and it's going to be fun to see this release wrap up. I need a vacation :-)

I didn't want to be totally lame though and not blog at all, so here are a few interesting things I wanted to point out:

  • Office 2007 released to manufacturing - it's been a lot of hard work, but it's been worth it.
  • Save Open XML from Older versions of Office - since it's just a web release, you can already get the final version of the free updates that allows older versions of Office to open and save in the Office Open XML format (before Office 2007 is available). This latest release allows you to save using the final 1.5 version of the Ecma working draft (which will be up for a final approval vote by the Ecma General Assembly in about a month).
  • Latest version of the ODF to Open XML translator available - The team working on this open source project has announced an updated release of the translator is now available. It sounds like they are pretty much done with the ODF to Open XML conversion for wordprocessing documents, and they even have a prototype going the other way (Open XML to ODF). Another really funny thing they found when testing this tool was that while Google claims ODF support in google docs, they actually are supporting the old star office XML format, but putting the ODF extension on it.

-Brian

Posted Tuesday, November 07, 2006 2:08 PM by BrianJones | 0 Comments

Novell and Microsoft teaming up on document interoperability

I'm sure by now most folks have read the news on the collaboration agreement between Novell and Microsoft. If not, you can read more up on the Microsoft interoperability site: http://www.microsoft.com/interop (Novell's press release is here)

There are a number of really cool pieces to this (Jason Matusow blogged on this yesterday), and one of those directly relates to the Office Open XML file formats. With this announcement Novell has said they will do the development work to allow OpenOffice to support the Office Open XML formats. This plug-in will be directly distributed with their edition of OpenOffice, but it will also be provided back to the OpenOffice.org organization so that everyone can leverage it. 

Another great piece is that they are going to start participating in the Open XML Translator project. It's an open source project we helped start back in the summer that will translate from Open XML to ODF and from ODF to Open XML. It's a really sweet project because it could be plugged in just about anywhere. You can read more about it on the blog that the developers of the project set up (http://odf-converter.sourceforge.net/blog/index.php).

Here is some more information from the FAQ up on Novell's site (http://www.novell.com/linux/microsoft/faq.html):

Document Format Compatibility. Microsoft and Novell have been focusing on ways to improve interoperability between office productivity applications. The two companies will now work together on ways for OpenOffice and Microsoft Office users to best share documents and both will take steps to make translators available to improve interoperability between Open XML and OpenDocument Formats.

As you know, Novell has been working with us the past year on the Ecma standardization of the Office Open XML file formats. Jody Goldberg, who works on both Gnumeric as well as Open Office was a huge help with SpreadsheetML. One of the key things that Novell was focused on was ensuring that the formats were fully documented and interoperable so that they could support them as well.

I also really liked reading Michael Meeks' blog on this subject. Michael is a distinguished engineer from Novell who works on OpenOffice (and he's Jody's manager):

  • Inevitably people will have some really good questions here, and to save my fingers I thought I'd point out a few points.
    • Why help Microsoft with OpenXML interop. ? OpenXML sucks, OpenDocument Rocks !?. So several thoughts:
      • This should not be a surprise - Jody Goldberg (on my team) has been working hard for months with Microsoft and others on the ECMA process. At one stage there around 1/2 the open 'issues' wrt. improving disclosure (and hence the spec.) came from Jody. I for one am proud of the job that he did there, an (ongoing) investment that will yield better interoperability for years to come.
      • As I have said for many months now, focusing on an 'Open-Standard' of ~700 pages written by a small team over a short period, is to miss the staggering value that is found in Free software. OpenOffice (as anyone who tried to start it recently knows) contains millions of lines of code, and a staggering investment of thousands of man years of sweat, tears (and perhaps blood). It's localized to umpteen languages, has deep help, scripting, accessibility, interoperability; it's just an immensely feature rich and powerful product.
      • To re-emphasise this, the value in OpenOffice.org is not what file format it supports (eg. we want to add good Lotus Word Pro support) but that it is truly Free software, that gives people critical Freedoms. An open format is anyhow implicit in the native file format of any openly developed Free software project.
      • Telling people about open standards, instead of Free Software is easy - 'normal' people generate data, not software so they understand, but it sells them radically short. In my view better interoperability (with any and all formats) strengthens Free Software, quite without the obvious pragmatic benefits to users & customers.
    • Why do business with these scum ?
      • It's true there is a widespread perception of unfair business practice from Microsoft out there, but my experience of working in the ECMA process with the developers, has been of meeting a (to my mind) mis-directed, but equally passionate world-view based around the love of their technology.
      • Broadly, I think it's fair to say there is a certain kind of person that loves to solve complex, technical problems, and I like that kind of person. It's also interesting to note that the average Microsoft (from my small sample) political viewpoint is -way- to the left of the average Novell Free software developer (perhaps a statistical aberration but ...). So, in a nutshell, they're good guys, if mis-directed. The great news is that we can help change that direction and get these guys addicted to the Free Software model.
      • One couple it was fun to meet, both on the Office team, obviously in love, confided in me that they had delayed their marriage to meet the Office 12 schedule: is that dedication ? Let's hope Wedding 2007 will ship on time; but imagine if we can help focus these guys on improving Linux <-> Windows interoperability, and in time Free software for it's own sake.
    • What does it mean for OpenOffice ? - my hope is over the long haul: better interop, more bodies hacking on OO.o, wider penetration of (Novell's) OpenOffice into the enterprise, and more individuals able to boldly hack on Free software.
    • What does it mean for Hackers ? - of course, I'm pleased that our team got such a great formal IPR covenant for individual developers from Microsoft. For sceptics that think this is a pure gesture, it's always surprising to me how a few key people seem to pop up again and again in Free software, and not everyone has the 7 year stamina that can be required, the RIAA demonstrates the danger well.
    • What does it mean for Novell ? - I'm pleased that it seems Microsoft will be distributing lots of SLES coupons, the more the merrier. Of course Nat and Miguel who helped setup the deal have a clearer view.
    • What is this Translator ? - it's the early stages of a open-source project to make a standalone bi-directional Open XML to OpenDocument converter. See SourceForge: odf-converter. What is important to me is not the set of design choices here (eg. a standalone XML to XML converter, though that may be useful for other Free software projects, or it's capabilities: a sub-set of Word only so far), but the end-goal of getting substantially better MS Office interop. (with OpenXML) into OO.o.

It's really cool to see that there will be a number of office applications (Corell, OpenOffice, older versions of MS Office) that will have support for these formats. I've personally been even more excited about the smaller 3rd party non-"office-type" applications that can also now get involved in consuming and generating rich office documents. It continues to raise the value of office documents, as they are no longer just a black box, but instead every office document can serve as a data source. I'm like the possibilities for the developer community that keeps growing here. There is obviously a ton of valuable information that's going to be made available via solution providers. The fact that to both consuming and generating documents has become so much easier is huge.

The translator project is particularly interesting though as it makes it easier for folks to choose the format they want to work with. The Office Open XML formats clearly have customer needs that they were designed to solve; and the OpenDocument format had customer needs that those folks were trying to solve.

-Brian

P.S. Michael,
Tristan and Krista are still on track for the wedding (and it will actually be Wedding 2006, not 2007 :-)

Posted Friday, November 03, 2006 2:32 AM by BrianJones | 8 Comments

Simple SpreadsheetML file Part 1 of 3

I posted a bunch of "Intro to SpreadsheetML" posts about a year or so ago, but those were all based on the Office XP spreadsheetML format. I think an updated series based on the Open XML standard is long overdue. I'll start off just building a simple table, and in future posts show more about formatting, formulas, and maybe even some charts.

Today, we'll start by creating a simple table. Then we'll add a little bit of number formatting and some formulas. In the end, we'll have the following table (where the tax & total columns are automatically calculated based on the Sub Total column):

Sub Total

Tax

Total

$ 14.95

$ 1.20

$ 16.15

$ 19.95

$ 1.60

$ 21.55

$ 4.95

$ 0.40

$ 5.35

We'll take this in 3 separate blog posts:

  • Part 1 - Create the simple table without formatting or calculations
  • Part 2 - Add functions to calculate "Tax" and "Total"
  • Part 3 - Add formatting so the data shows up as currency

Part 1 - Simple Table

Since we won't do the formatting or formulas initially, our table will look like this:

Sub Total

Tax

Total

14.95

   

19.95

   

4.95

   

As I discussed in my "simple wordprocessingML document" post, the Office Open XML format is comprised of a number of XML files within a ZIP package. The files follow a simple set of conventions called the open packaging conventions (described in Part 2 of the standard). You need to declare the content types of the parts, as well as tell the consuming application where it should start (via the package relationship).

Unlike the WordprocessingML document we created though, a SpreadsheetML file has a bit more structure to it (the same is true for presentations). A SpreadsheetML file is actually a workbook that can contain multiple worksheets. so even your most simple workbooks will have at least 5 files within the ZIP package. So for this example, let's start by creating a folder somewhere and in that folder create the following files:

  • workbook.xml
  • worksheet.xml
  • [Content_Types].xml
  • _rels/.rels
  • _rels/workbook.xml.rels

workbook.xml

The workbook is essentially the container for the various worksheets. The workbook is where you can reference the styles part, shared string tables, and any other pieces of information that apply to the entire Spreadsheet file. In this example, since we're just creating a super basic spreadsheet, the workbook will be very simple:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<
workbook xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships">
  <sheets>
    <
sheet name="Brian" sheetId="1" r:id="rId1"/>
  </
sheets>
</workbook>

The only interesting thing we did here was to create the sheet tag, which then references out worksheet via the r:id attribute. Remember that almost every time you reference another part or even something outside of the file like a hyperlink or a linked image, you will use a relationship. The next thing we need to do is actually create that relationship in the workbook.xml part's relationship file.

_rels/workbook.xml.rels (part 1)

This is pretty basic. We just need to create a relationship that has an id of rId1 so that it will match the reference from the workbook.xml part:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<
Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
  <
Relationship Id="rId1" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/worksheet" Target="worksheet.xml"/>
</
Relationships>

Notice that a relationship has three main attributes. It has an Id attribute who's use will be more obvious in a bit. The Target attribute tells you where to go, and the path is relative to the parent directory of the "_rels" folder that relationship file is in (in this case that's the root directory). The Type attribute describes what kind of relationship it is (ie what kind of stuff is it pointing at).

In this part we have one relationship who's type is "worksheet", and the target points to our worksheet.xml part. Now we need to actually put some content in the worksheet.xml part

worksheet.xml

The worksheet.xml part is going to be pretty simple. The first row in the sheet will have the column titles ("Sub Total", "Tax", and "Total"). The next 3 rows will only have data in the first column as we won't create the calculation functions until later in this example.

The worksheet.xml part should look something like this:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<
worksheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships">
  <
sheetData>
    <
row>
      <
c t="inlineStr">
        <
is>
          <
t>Sub Total</t>
        </
is>
      </
c>
      <
c t="inlineStr">
        <
is>
          <
t>Tax</t>
        </
is>
      </
c>
      <
c t="inlineStr">
        <
is>
          <
t>Total</t>
        </
is>
      </
c>
    </
row>
    <
row>
      <
c>
        <
v>14.95</v>
      </
c>
   
</row>
    <
row>
      <
c>
        <
v>19.95</v>
      </
c>
    </
row>
    <row>
      <
c>
        <
v>4.95</v>
      </
c>
    </
row>
  </
sheetData>
</
worksheet>

If you've played around with the SpreadsheetML that Excel outputs, you'll notice that I've handled strings a bit differently. For faster save and load times, Excel actually uses the shared string table that is an optional feature of the Office Open XML formats. It allows you to write a string just once, and then reference that string from within the grid. So, instead of saying "Sub Total" in that first cell, Excel would have an id that references the entry in the string table for "Sub Total". This is just an optional feature though, and for simplicities sake I've just put my strings inline (using the <is> tag). There isn't really any perf gain if the string is only used once.

Also, you'll notice that unlike a typical table format (like HTML, CALS, etc.) the XML above is representing a spreadsheet. It's a subtle difference when working with simple examples like this, but becomes more obvious as you move into more complex spreadsheets. One noticeable difference right away though is that we don't write any elements down for the empty cells B2:C4. If there isn't any data in a cell, then you just don't write anything. This is a bit of a different model from table formats that are more presentation based.

_rels/.rels

How does a consuming application know where it should start when opening an OpenXML file? The first place you always look is the package relationships file. The package relationship file will always be located in the "_rels" directory, and it's always called ".rels". We need to create an XML file that tells the consumer that "workbook.xml" is the first place you should go, and that this type of document is an Office Open XML document:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<
Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
  <
Relationship Id="rId1" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument" Target="workbook.xml"/>
</
Relationships>

[Content_Types].xml

OK, so we've now created the main workbook.xml part and the worksheet.xml part, as well as created a relationship between the two.

Every Office Open XML file must declare the content types used in the ZIP package. That is done with the [Content_Types].xml file. We currently have two parts in this document that we need to declare content types for. The first is the document.xml part; the second is the _rels/.rels part. So, the content types file should look like this:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<
Types xmlns="http://schemas.openxmlformats.org/package/2006/content-types">
  <
Default Extension="rels" ContentType="application/vnd.openxmlformats-package.relationships+xml"/>
  <
Override PartName="/workbook.xml" ContentType="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet.main+xml"/>
  <
Override PartName="/worksheet.xml" ContentType="application/vnd.openxmlformats-officedocument.spreadsheetml.worksheet+xml"/>
</
Types>

Here we are saying that anything ending with the .rels extension is of type Package Relationship, and we also declare that the part workbook.xml is of type workbook and worksheet.xml is of type worksheet.

Create Version 1 of our simple SpreadsheetML file

OK, we should now have five files. Three of the files are in the root directory (workbook.xml; worksheet.xml & [Content_Types].xml); and in the "_rels" directory we have the ".rels" file and the "workbook.xml.rels" file. Select the two files and the "_rels" directory and ZIP them up. Make sure that when you zip them up, the two files and the _rels directory are all at the root level.

Open this file in Excel, and you now have a simple file.

*Beta 2 TR Note*

Note that the namespaces are different between B2TR of Excel, and the final version of the Ecma standard. In this post, I have shown what the final version of the Ecma standard would look like. The RTM build of Office will use this same format. If you are on B2TR though, then you'll need to tweak the following namespace in the XML files: http://schemas.openxmlformats.org/spreadsheetml/2006/main

Instead of using that namespace, you'll need to use this namespace http://schemas.openxmlformats.org/spreadsheetml/2006/7/main to get it working in B2TR. The two parts you'll need to update with this namespace are wordsheet.xml and workbook.xml. Everything else should work fine

Well, that was the first piece. In the next post, we'll add functions to the spreadsheet.

-Brian

Posted Thursday, November 02, 2006 10:50 AM by BrianJones | 3 Comments

Filed under: ,

Friday thoughts (Oct 27, 2006)

I'd been meaning to post a write-up on how to create a simple SpreadsheetML document from scratch, but just haven't had the time this week to pull it all together. Hopefully I'll get that out early next week. I had already done a similar post for WordprocessingML (both for Beta 1 as well as RTM). Here are a couple things I wanted to point out for the week:

  1. Arccast interview on Office Open XML - Doug Mahugh and I did a live webcast last month with Ron Jacobs. Ron now has both Part 1 and Part 2 of the interview available up on Channel 9.
  2. Upcoming SpreadsheetML Generator - Stephane Rodriguez (who also wrote the Open XML diffing tool) is getting really close to releasing the latest version of his xlsgen tool; and it will include SpreadsheetML support (http://www.arstdesign.com/BBS/BulletinBoard.php?qs_id=1661). I think he's planning on still going with the B2TR version of SpreadsheetML and won't update to the RTM version until after Office 2007 ships. There were still a couple changes that tool place between B2TR and RTM for spreadsheetML that brought it fully inline with the Ecma standard, but that resulted in B2TR not having the ability to open RTM spreadsheetML files (similar to what all three applications experienced between B2 and B2TR).
  3. Apose.Words supports WordprocessingML - I saw this blog post the other day that mentions that Apose.Words now supports exporting as WordprocessingML.
  4. Document your SQL DB using WordprocessingML - From this blog post: "Data Dictionary Creator (DDC) is a simple application which helps you document SQL Server databases. It stores all the information in Extended Properties, so it's easier to keep the documentation in sync with the database as it changes… DDC exports to WordML, Excel, HTML, and XML."
  5. Generate Wordprocessing Documents from your SAP Web Application Server - This is a cool intro article that shows how you can leverage WordprocessingML to generate rich documents directly from your SAP server. It would be really interesting to see some examples of leveraging the custom defined schema support and content controls in Word 2007 to not only populate the documents with SAP data, but to also mine that information back out of the document if the user has edited it.
  6. Leverage SpreadsheetML to build rich reports - I actually don't know anything about this product :-), but I randomly came across it and noticed that it allows you to generate spreadsheets using the original SpreadsheetML format we started working on over 8 years ago (and shipped with Office XP): "Perfect table creation – NEW SPREADSHEETML The new Microsoft format SpreadsheetML is supported. Based on XML, it generates richer editing and formatting of Excel files optimized for Windows Office 2003. Your tables are perfectly reproduced in Excel 2003, retaining the text and the colored background." I always love seeing people leveraging the existing technologies. The new SpreadsheetML format will give them a lot more power, as the old one didn't support Excel's full feature set.
  7. WordML import and export on the Mac - I think that this tool is built on top of the TextEdit functionality built into the Mac (which supports WordML itself). Not sure if they actually do anything additional in terms of the WordML support.

I hope everyone has a great weekend.

-Brian

Posted Friday, October 27, 2006 2:29 PM by BrianJones | 9 Comments

Performance of an XML file format for spreadsheets

I've blogged in the past about some of the things we did with the SpreadsheetML format to help improve the performance of opening and saving those files. While a file format most likely won't affect the performance of an application once the file is loaded, it can significantly impact the performance of the actual load or save. The majority of a wordprocessing file is usually the actual text content and anything you can do to avoid making the storage of that content complex the better. The majority of a presentation is usually the rich media and styling information and it's important to store those in an efficient manner. The biggest challenge though hands down in terms of file format performance though is a spreadsheet.

It's very easy to get Spreadsheets comprised of millions of cells. In older versions of Excel, it used to be that the maximum number of columns allowed in a worksheet was 256 and the maximum number of rows was 64,000. That meant that a single worksheet in a workbook could consist of over 16 million cells. An of course there can me many worksheets in a workbook.

There were a number of customers who were hitting this limit and wanted us to increase the number of rows and columns allowed in a worksheet (this was actually one of Excel's top requests). This had been a request for awhile, but it was going to be extremely difficult to increase this limit in the old binary formats so we didn't do it. The move to the new Open XML formats though allowed us to increase the limits, so now in Excel 2007, you can have up to 16,000 columns and 1 million rows. That means you could have a worksheet with 16 billion cells. I can't imaging anyone hitting that limit, but to be honest, I couldn't believe folks were hitting the older limits.

So, while that's a look at the extreme end of things, it's not too uncommon to have very large spreadsheets, and a workbook with a million cells of data is definitely something we see a lot of.

Now, while in a Wordprocessing document you may potentially have a million words (which would be about 2000 pages), it's not as likely. More importantly though, the structure of the XML doesn't have as big of an impact. In wordprocessingML, you don't have every word contained in it's own XML element. So you while may have a million words, your XML element count would probably only be in the tens of thousands.

In a spreadsheet though, you have each cell represented by at least one, if not more XML elements which means that a spreadsheet of any reasonable size can easily have millions of XML elements and attributes. This can have significant implications on the load and save performance of the files.

A couple examples:

There are a number of things we looked into doing to help improve the performance of these files both using current technology as well as thinking about future approaches. Some of the approaches we took that are easiest to understand are:

  1. Reduce the need to fully parse repeating data - In order to cut back on parsing large strings repeated multiple times, or formulas that are repeated down an entire column, spreadsheetML does a lot of sharing. I already talked about the huge benefits we can get from the shared formulas work in a post last spring. The shared string table can give similar results.
  2. Tag Size - Contrary to this post from IBM's Rob Wier, the size of XML elements actually does directly affect performance times. The more text you have to parse, the longer it will take. Rob I hope you didn't misunderstand some of my earlier points on this though. I don't mean in any way to give the impression that tag size is the biggest factor in terms of file format performance because it's not. There are a number of other architectural issues that play a much bigger role. That said, the size of the tags does have an impact (as I've blogged about before).

    The simplest way to see the impact for yourself is to just take an application like OpenOffice, and get a decent sized spreadsheet (I took a file with just one worksheet, that was 15 columns by 15,000 rows). Save it in ODF and make 2 copies. In one of the copies open content.xml and change some of the namespace prefixes to make them much longer. I aimed to make it so that the resulting content.xml file was about 3 times larger than the original since this helps replicate the size difference you would see if we changed SpreadsheetML so that it's tag length was longer and more descriptive like ODF (it would probably be more than that, but I found that making content.xml 3 times larger was more than enough to get a noticable slowdown in load times). I used namespace prefix approach because it's the easiest way to replicate the effect of a larger element name without changing the actual element names (since that would prevent the file from being readable by OpenOffice). In my test, by essentially tripling the size of the element names, I found that the file went from loading in about 9 seconds to taking about 11 seconds.

  3. Splitting the XML into multiple parts - In SpreadsheetML we break the XML content out into multiple parts in the ZIP (each worksheet, the string table, pivot table data, etc.). This can help a ton with on demand loading, or even loading multiple pieces at once. For example, it means that if you had a multi-proc machine, you could load each worksheet on a separate thread. It also means that you could decide to only load on sheet and you wouldn't have to parse through all the XML from the other sheets.
  4. Relationships stored outside of content - By storing all the relationships from one part to another in separate (and much smaller) files, it makes it really easy to see what other parts you should load when you are loading a particular part of the file. If it weren't for the relationships, you'd actually have to parse through the content of the XML to determine what other resources that part used. For example, if you wanted to just load one slide in a presentation, you can also see what images are used on that slide before you start parsing the XML for the slide.

There are clearly a large number of factors that play into the performance of a particular format. As you can see, it primarily comes into affect when loading or saving a file. If you do any type of delay loading though or incremental saves, it can also start to have an impact on your editing performance though. The key issues I've focused on primarily affect the general load and save times. The examples above are just a handful of issues we looked at when designing the formats.

-Brian

Posted Thursday, October 26, 2006 3:24 AM by BrianJones | 14 Comments

SpreadsheetML Dates

Recently a couple people have questioned the decision to leave the legacy Excel date behavior in the Open XML formats. This was primarily triggered by a post from IBM's Rob Wier. While those folks have referred to it as an Excel bug, it's actually something that was purposely designed into Excel since the beginning. Whenever you work on a feature, you eventually get to a point where you have to make difficult decisions and there are positives and negatives on each side. Let's go though a couple points here around this date issue: (1) how the dates are stored; (2) history of the bug; (3) backwards compatibility concerns; (4) drawbacks.

How dates are stored

Let's first look at the basic issue of how dates are stored in file formats.

When it comes to the storage of dates, there are a number of different approaches you could take. The key issue to first decide is whether you want to store it in a presentation friendly format, or one designed for easy storage and processing. For example, you could just store it as it's displayed to the user (ie "10/23/2006" or "October 23, 2006"), or maybe using ISO 8601. This could actually be a fine approach for certain types of formats (like a wordprocessing application, or maybe for metadata), but in the case of a spreadsheet that's probably not the best way to go. In order to quickly consume and create the files, you should probably choose one simple consistent storage method as you may be dealing with hundreds of thousands of dates in one file and you don't want to have to deal with any complicated parsing.

In the case of spreadsheetML, where you have the possibility of hundreds of thousands (if not millions) of dates existing in a single workbook, you need to think carefully about how those dates are going to be used. You also need to think about the fact that there may be formulas that take those dates into account. For example, you might have a table like this:

Project Number

Start Date

End Date

Daily Cost

Total Cost

1

9/13/2006

9/27/2006

$5,000.00

$70,000

2

9/20/2006

10/3/2006

$1,000.00

$13,000

The "total cost" column will have a function that is essentially:

=([end date] - [start date]) * [daily cost]

The simplest way to store these dates (both internally and in the formats) is to just pick a specific date as the base date and then every other date is just an index from there. This is similar to the Julian date system. So, if you decided that 9/13/2006 was the base date, then the data stored in the table above would be:

Project Number

Start Date

End Date

Daily Cost

Total Cost

1

1

15

$5,000.00

$70,000

2

8

21

$1,000.00

$13,000

This makes the storage and parsing of the dates a lot easier, and it also makes any formulas that work with dates easy to work with. The dates are always just stored as an index, and then it's just a matter of formatting to determine how the date is presented to the user. So the following dates (9/13/2006; Sept. 13 2006; Wednesday, September 13, 2006) are all the exact same values, they are just displayed to the user differently. This approach also has the advantage of working with other calendar systems since the storage of the date doesn't account for months, years, or days of the week, but is instead just a counter of how many days from a given date have passed. Then any unit of measurement smaller than a day (hours, minutes, seconds, milliseconds, etc.) are just a fraction. So 1 second would be 1.157e-5

History of the date "bug"

It's because of this "index" approach that the bug in question presents itself. Excel uses January 1, 1900 as the base date, meaning January 1, 1900 is stored as "1". The problem is that 1900 was not a leap year because while it passes the first test of being divisible by 4, it doesn't pass the second test which is that it must a multiple of 100 that is not divisible by 400. 2000 actually was a leap year, but 2100 will not be. Rob Wier's blog post has some great information on the history of this behavior in our modern calendar system.

Unfortunately this bug occurs because, February 29, 1900 is actually treated as a date when the indexes are calculated. This means that every date after February 29, 1900 is off by one (or you could also say that the base date was supposed to be 12/31/1899 and every day before March 1, 1900 is off by one). This is only the case for the year 1900 though. 2100 and on are treated properly. Why did this happen in the first place? It's actually covered pretty well in the following Microsoft KB article:

When Lotus 1-2-3 was first released, the program assumed that the year 1900 was a leap year, even though it actually was not a leap year. This made it easier for the program to handle leap years and caused no harm to almost all date calculations in Lotus 1-2-3.

When Microsoft Multiplan and Microsoft Excel were released, they also assumed that 1900 was a leap year. This assumption allowed Microsoft Multiplan and Microsoft Excel to use the same serial date system used by Lotus 1-2-3 and provide greater compatibility with Lotus 1-2-3. Treating 1900 as a leap year also made it easier for users to move worksheets from one program to the other.

Although it is technically possible to correct this behavior so that current versions of Microsoft Excel do not assume that 1900 is a leap year, the disadvantages of doing so outweigh the advantages.

If this behavior were to be corrected, many problems would arise, including the following:

  • Almost all dates in current Microsoft Excel worksheets and other documents would be decreased by one day. Correcting this shift would take considerable time and effort, especially in formulas that use dates.
  • Some functions, such as the WEEKDAY function, would return different values; this might cause formulas in worksheets to work incorrectly.
  • Correcting this behavior would break serial date compatibility between Microsoft Excel and other programs that use dates.

If the behavior remains uncorrected, only one problem occurs:

  • The WEEKDAY function returns incorrect values for dates before March 1, 1900. Because most users do not use dates before March 1, 1900, this problem is rare.

NOTE: Microsoft Excel correctly handles all other leap years, including century years that are not leap years (for example, 2100). Only the year 1900 is incorrectly handled.

Backward Compatibility

While it's no longer as important to maintain compatibility with Lotus 1-2-3, it is important to maintain compatibility with the billions of existing spreadsheet files out there. Barclay's Capital was one of the members of the Ecma TC45, and they helped make it crystal clear how important this backwards compatibility was. Imagine how upset our customers would be if we changed the values returned from functions like =WEEKDAY.

I think that one of the reasons some folks think this is just a "stupid decision" made by TC45 is that they aren't looking at all the places where the behavior of a spreadsheet could be affected. Remember, we don't just want the date values to be interoperable, but spreadsheet functions have to be as well. We spent a lot of time documenting how every single spreadsheet function would work so any application can come along and properly consume and generate SpreadsheetML files.

If we decided to fix this bug and shift each date value down by one, how many spreadsheet formulas out there would we break? Here's a really simple example, where the following function that had worked in previous versions would no longer work:

=IF(TODAY()=39013, "Due Today!", "Not Due Today!")

I will admit that the above function is probably not the best approach for comparing dates, but it's something that's always worked. Believe me, I've seen much crazier stuff in spreadsheets. We can't control how our customers use the product, and if something like this worked before, we can't break it now. If we changed our behavior so that 39013 no longer represented 10/23/06 and instead represented 10/24/06, that would completely break the function above.

It's true that we could probably write a solution that parses through all functions out there and tries to determine if the number in the function is supposed to represent a date, but that would be seriously buggy, and to be honest it wouldn't be worth it.

Drawbacks?

One way I like to think about this whole issue is to look at this the other way. Let's instead say that the base date system for SpreadsheetML was supposed to start on December 31, 1899. There's just a bug where dates between then and March 1, 1900 are off by one. So while we have this odd behavior for the first two months of the last century, everything after March 1, 1900 is stored correctly. I'm actually only half kidding here... since the date is just stored as in index, there is no real problem caused by this bug, other than maybe looking a bit foolish. In fact, OpenOffice has a behavior vary similar to this. In OpenOffice and Microsoft Excel, the numerical value for 9/13/2006 is equivalent. The numberical values for dates map all the way back to March 1, 1900; and then from that point back they are off by one (so in OpenOffice, 12/31/1899 maps to "1"). So the key thing to notice is that this behavior is fully documented, and there is no reason it should prevent anyone from properly consuming and generating valid SpreadsheetML files.

As I said in some earlier comments, this did come up in the TC45 meetings fairly early on. We didn't spend too much time on it though as it was pretty obvious that while it looked a bit silly, there was no harm. And we all recognized that if we did try to change this, it could seriously change the meaning of existing files.

Remember, this format was not an attempt at building the ultimate generic Office application file format from the ground up. If we did that, and didn't take our customers existing documents into account, then none of our customers would use the format. I think at times people misunderstand the design goals behind the Office Open XML formats, and I'm sorry if I haven't been clear enough on that. The Office Open XML formats were definitely designed with the existing base of legacy Microsoft Office file formats in mind, and it's important to remember that when you look at the format. There are definitely designs in the Open XML formats that clearly show their origins. A key focus of TC45 though was to make sure those behaviors were completely implementable on any other platform. This is covered in more detail in the TC 45 whitepaper.

We not only wanted to create an open format that folks could build solutions on top of, but we wanted the format to be something that our customers would actually use... otherwise what's the point? We didn't want this to just be another optional format that only some people would use, it's the new default format and we hope that all of our customers will use it.

-Brian

Posted Wednesday, October 25, 2006 2:34 AM by BrianJones | 24 Comments

Whitepaper summarizing the Office Open XML standard

Ecma has now published a 14 page whitepaper that does an excellent job of describing the Office Open XML standard and the different goals and challenges TC45 had over the past year while working on the spec. I highly recommend everyone interested in office file formats take a look: http://www.ecma-international.org/news/TC45_current_work/OpenXML%20White%20Paper.pdf

Rather than describe the whitepaper in my own words, I figured I'd just leverage the introduction as it does an excellent job of summarizing the overall purpose of the whitepaper:

Office Open XML (OpenXML) is a proposed open standard for word-processing documents, presentations, and spreadsheets that can be freely implemented by multiple applications on multiple platforms. Its publication benefits organizations that intend to implement applications capable of using the format, commercial and governmental entities that procure such software, and educators or authors who teach the format. Ultimately, all users enjoy the benefits of an XML standard for their documents, including stability, preservation, interoperability, and ongoing evolution.

The work to standardize OpenXML has been carried out by Ecma International via its Technical Committee 45 (TC45), which includes representatives from Apple, Barclays Capital, BP, The British Library, Essilor, Intel, Microsoft, NextPage, Novell, Statoil, Toshiba, and the United States Library of Congress (1).

This white paper summarizes OpenXML. Read it to:

  • Understand the purposes of OpenXML and structure of its Specification
  • Know its properties: how it addresses backward compatibility, preservation, extensibility, custom schemas, subsetting, multiple platforms, internationalization, and accessibility
  • Learn how to follow the high-level structure of any OpenXML file, and navigate quickly to any portion of the Specification from which you require further detail

I also really like the second section in the whitepaper titled "Purposes for the Standard," as it helps in dealing with a lot of the questions I've received over the past several months. I think at times folks still aren't clear on the reasons we created this file format in the first place and then passed ownership of it to Ecma international.

OpenXML was designed from the start to be capable of faithfully representing the pre-existing corpus of word-processing documents, presentations, and spreadsheets that are encoded in binary formats defined by Microsoft Corporation. The standardization process consisted of mirroring in XML the capabilities required to represent the existing corpus, extending them, providing detailed documentation, and enabling interoperability. At the time of writing, more than 400 million users generate documents in the binary formats, with estimates exceeding 40 billion documents and billions more being created each year.

The original binary formats for these files were created in an era when space was precious and parsing time severely impacted user experience. They were based on direct serialization of in-memory data structures used by Microsoft® Office® applications. Modern hardware, network, and standards infrastructure (especially XML) permit a new design that favors implementation by multiple vendors on multiple platforms and allows for evolution.

Concurrently with those technological advances, markets have diversified to include a new range of applications not originally contemplated in the simple world of document editing programs. These new applications include ones that:

  • generate documents automatically from business data;
  • extract business data from documents and feed those data into business applications;
  • perform restricted tasks that operate on a small subset of a document, yet preserve editability;
  • provide accessibility for user populations with specialized needs, such as the blind; or
  • run on a variety of hardware, including mobile devices.

Perhaps the most profound issue is one of long-term preservation. We have learned to create exponentially increasing amounts of information. Yet we have been encoding that information using digital representations that are so deeply coupled with the programs that created them that after a decade or two, they routinely become extremely difficult to read without significant loss. Preserving the financial and intellectual investment in those documents (both existing and new) has become a pressing priority.

The emergence of these four forces – extremely broad adoption of the binary formats, technological advances, market forces that demand diverse applications, and the increasing difficulty of long-term preservation – have created an imperative to define an open XML format and migrate the billions of documents to it with as little loss as possible. Further, standardizing that open XML format and maintaining it over time create an environment in which any organization can safely rely on the ongoing stability of the specification, confident that further evolution will enjoy the checks and balances afforded by a standards process.

Various document standards and specifications exist; these include HTML, XHTML, PDF and its subsets, ODF, DocBook, DITA, and RTF. Like the numerous standards that represent bitmapped images, including TIFF/IT, TIFF/EP, JPEG 2000, and PNG, each was created for a different set of purposes. OpenXML addresses the need for a standard that covers the features represented in the existing document corpus. To the best of our knowledge, it is the only XML document format that supports every feature in the binary formats.

Tom Ngo, the editor of the whitepaper did an excellent job of summarizing what I've tried to convey numerous times in this blog (with varying degrees of success I admit). The industry absolutely needs OpenXML, and we've heard this repeatedly from our customers. It doesn't make it the "file format to replace all file formats," and we would never make such ridiculous claims. Instead it's simply an open standard that helps serve very real customer needs.

-Brian

Posted Tuesday, October 24, 2006 9:41 AM by BrianJones | 3 Comments

More on content controls

I posted earlier this year on the support for custom defined schema in wordprocessingML via the new content controls functionality. The key reason for opening the file formats was to make Office a more valuable platform for solution builders. We wanted folks to have the ability to take Office documents and plug them into new and existing business processes. All the work we've done in Ecma has helped to ensure that people have the necessary information for implementing the new file formats, but that isn't enough. The Office Open XML specification defines the XML for Office documents, but it doesn't define the types of structures that our customers use for their own business data. That's where the support for custom defined schema comes into play. It allows people to take their industry specific data structures (either their own, or one defined by someone else), and apply that structure to their documents so they have that additional semantic meaning. I'd promised to post more information on this but that fell through the cracks (sorry).

Tristan Davis from the Word team is now blogging on content controls and he'll be able to cover it in much greater detail than I could. You should check out his first post up on the Word blog: http://blogs.msdn.com/microsoft_office_word/archive/2006/10/23/control-yourself.aspx

-Brian

Posted Monday, October 23, 2006 11:44 AM by BrianJones | 1 Comments

Filed under: ,

Friday thoughts

I just had a couple random things to mention this afternoon:

  1. ARCast – Office 2007 Open XML Format (Part 1 of 2) - Doug Mahugh and I did a live webcast last month with Ron Jacobs (http://www.skyscrapr.net/blogs/arcasts/archive/2006/10/18/370.aspx). Ron just recently posted an audio recording of the first half of the discussion. Not sure when he'll post the second half…
  2. Convert SpreadsheetML into generic XML - There's a new article up on openxmldeveloper.org that shows how you can convert spreadsheetML into generic XML using an XSLT, and then bind that xml data source to an ASP.NET data grid (which you can then display in the browser).
  3. Performance of XML file formats - I saw these two posts from IBM's Rob Wier discussing different issues around XML file format performance (Celerity of verbosity and Why is OOXML slow?). I love the fox news style of that second post's title BTW ("Democrats want to destroy your family?")… just kidding Rob <g/>. I have to admit that the first post and the second seem a bit contradictory. The first post says that the size of the XML file doesn't affect parse times; and then the second one says that wordprocessingML is slower to parse than ODF because of the larger file size. I admit I haven't had a chance to drill deeper, so I'm sure there is more to it than that (Rob's a performance architect, so I doubt he would miss that).
    I had a post a number of months ago around tag size being an issue in XML parsing times, and I still hold to that. I'll actually try to pull together some numbers to back that up as it sounds like Rob disagrees. There are of course a number of other factors that play a much more important role in the structure of the file format besides tag size (I even mentioned in my original post that tag size itself was a small factor, but still significant enough that we made the decision to use terse tag names on any structure that is likely to repeat often throughout the file). The other issue is that in Rob's experiments he is focusing on WordprocessingML rather than SpreadsheetML. Spreadsheets really are the bigger item in this discussion, as they can have hundreds of millions of XML tags in a single file. In a large wordprocessing document, it's really the text content itself that makes up a lot of the file, and there aren't nearly as many XML tags.
  4. History of richedit - Murray Sargent has a couple great posts discussing richedit and his history with the team (he's been working on it since 1994). The first post discusses the different versions of richedit (http://blogs.msdn.com/murrays/archive/2006/10/14/richedit-versions.aspx). The second post goes into more of the history behind the project and talks about how 3rd parties can leverage it (http://blogs.msdn.com/murrays/archive/2006/10/20/some-richedit-history.aspx).
  5. Standard at Microsoft - Jason Matusow had a couple interesting posts this week. The first posts gives more information on the OSP which is a new approach we recently took towards making various formats freely available to developers (http://blogs.msdn.com/jasonmatusow/archive/2006/10/18/application-of-the-osp.aspx) . The second discussed some of his thinking around interoperability based on his latest trip out to Brussels (http://blogs.msdn.com/jasonmatusow/archive/2006/10/18/interoperability-in-europe.aspx). There are really some important points here around what interoperability really means, and what the most effective ways of building interoperable applications. Custom defined schema support for example is extremely important for allowing Office documents to easily interact with backend data.
  6. Calorie burning drink from coca-cola -Not to sound completely random, but I just saw this and it instantly brought back memories of that old Jim Carrey SNL bit about a hard core diet approach. "Ride the Snake"

Have a great weekend everyone.

-Brian

Posted Friday, October 20, 2006 5:47 PM by BrianJones | 6 Comments

They’re bringing out the big guns

Anyone else been following the latest blog posts from IBM and Sun discussing the Office Open XML formats? It looks like they're stepping up their push to try make ODF the only choice in file formats. I read Tim Bray's post yesterday, but there have actually been a number of other posts folks have pointed out to me as well. Everyone knows that Sun and IBM have a lot riding on ODF financially (they're large corporations, not philanthropies <g/>). It's clear that their plan is to somehow convince governments into mandating just ODF and remove any choice between the two formats.

Thankfully, what you're actually seeing in most places is that governments are asking for 'open formats' in general, not just ODF (contrary to what is usually written in the headlines). Most of those governments understand that Office Open XML is on the verge of becoming an international standard as well and it serves a very important purpose that ODF doesn't. This has raised the alarm bells for IBM and Sun though, and that's why we see the latest smear campaign kicking into gear. It could be that this is more innocent and that instead there is just a lack of technical knowledge. Based on the strong reputations of the folks involved in this campaign though it seems more malicious. I'm saying this after reading their claims that the spec is too complex and therefore not interoperable, which is just ridiculous. Too much information? Every developer I've talked to (even those working for companies that compete directly with Microsoft) is extremely grateful for the amount of information the spec has provided. Look at the 600 developers up on the openxmldeveloper.org site building all kinds of powerful solutions across a number of different platforms (Linux; Mac; Windows).

I think it's pretty ignorant for folks to call this effort a rubber stamp. Talk to the people from Apple, Novell, the British Library, the Library of Congress, Intel, BP, StatOil, Toshiba, Essilor, NextPage, and Microsoft who spent over 200 hours in group discussions around the formats. Look at the results of all the hours that went on in the smaller groups tasked with solving particular problems or those working on the actual documentation that had to go on between the weekly group meetings. The schemas themselves changed significantly and the spec went from 2000 to 6000 pages. Rubber stamp? You must be joking. <g/>

Another thing I've seen from an IBM employee is that he's trying to get more technical by examining the Office Open XML standard looking for minor nits and then attempting to turn them into big issues. That's fine and everyone is entitled to their own opinion. It's kind of funny though that many of the issues he raises are even worse in the ODF spec.

Why would IBM and Sun push for a more limited format?

There is this false claim from some high profile IBM and Sun employees that the Office Open XML spec is not interoperable because it's too big. These statements really help to paint a picture of their strategic interest in ODF. What's the easiest way to compete with another product that has a richer set of features? Get governments to mandate a file format that doesn't support that richer set of features. This way, if the other product (Microsoft Office in this case) has to use the format that was designed for your product, you've just brought them down to your level. It's a brilliant approach, and that shows why there are IBM vice-presidents flying around talking to governments about the need to mandate ODF. It also shows why they want to discredit the Office Open XML format… IBM and Sun feel they have a lot to lose if Office Open XML is standardized, and that's why they've been fighting so strongly in opposition.

Now, contrast that with the Microsoft position, where we've never opposed ODF. We didn't plan on supporting it, but we had no problem with other people using it. The only opposition we've ever had is to policies mandating ODF and blocking Office Open XML. We want choice; IBM and Sun on the other hand absolutely want to block choice. The spin they try to put on this is that by blocking choice in formats they are providing freedom to choose your application… what they don't way though is that we're doing that to an even greater degree. We're sponsoring a free open source project for translating between the two formats, which gives everyone the freedom to choose both the application and the format. Microsoft's view has been that open formats are really important and there is nothing wrong with both ODF and Open XML. IBM and Sun on the other hand want one specific open format (ODF), and that's it.

Now, if you look at it technically, there is no reason to complain about the size of the spec unless you are trying to limit the features supported by the spec. There are plenty of large specifications out there (look at the Java spec) that are completely interoperable. As an implementer of the Office Open XML specification, you are free to decide what pieces you want to implement.

Let's think about this complaint though that the specification is too large. What are the ways in which you could fix that:    

  1. Less documentation and explanation??? - I can't imagine anyone wanting this. Remember, the standard isn't a novel you're supposed to read end to end. It's a detailed description of every piece of the Office Open XML file formats and how it all works. More documentation is an important thing in this case.
  2. Less features??? - Who gains from this? Any implementer has the freedom to pick which part of the spec they want to support. Only applications who want to compete by bringing everyone down to their level would actually want features removed.

There are a lot of features between the three main schemas (WordprocessingML, PresentationML, and SpreadsheetML), and as a result the file format is very large. The ODF spec most likely would have been bigger if they had done a more thorough job documenting it, but even then it still doesn't compare in terms of functionality. One of the other justifications I've heard for the ODF spec being so much smaller is that it reuses other standards. That may account for some, but it still doesn't get you all the way (not even close).

We also looked at reusing other standards where it makes sense (Dublin core, ZIP, XML), but there are plenty of places where that didn't make sense (MathML). Take the example of MathML. It wasn't specifically designed for representing math in a wordprocessing document, but instead math in general. It's a good spec, and it does do a decent job in a wordprocessing document, but it's not able to handle everything that our customers would expect. It doesn't allow for the rich type of formatting and edit history that most customers of a wordprocessing application would want (see Murray's post for more details). Even more interesting though, to date there aren't any ODF wordprocessing applications out there that even support all of MathML. I think that Office 2007 actually has better MathML support with our import/export funcationlity. Another example given is the use of XSL-FO. It's a nice spec to reuse, but it doesn't fully define how international numbering should be done, so as a result OpenOffice has already extended the format in their own proprietary way.

XML itself has only been a standard for about 8 years. For one to assume that all the great thinking and tough problems in the Office document space have already been handled since then is ridiculous.

-Brian

Posted Wednesday, October 18, 2006 6:56 PM by BrianJones | 43 Comments

Simple wordprocessingML document (video demo)

Are the Office Open XML formats too complex? Not really. They are definitely very rich, but the structure of the formats is pretty simple. As you start to get into more complicated features, the complexity of the formats also kicks in though. To show how simple a basic document is though, I like to use this example wordprocessingML document whenever I give a presentation on the formats. I actually posted this example using Beta 1, but the formats have changed a bit so I though it would be worth posting an updated example. (you can view a video of a simpler version of this demo up on Channel 9)

Let's create a really simple document that has three things: a paragraph of text; a hyperlink; and an image.

Part 1 - single paragraph of text

The Office Open XML format is comprised of a number of XML files within a ZIP package. The files follow a simple set of conventions called the open packaging conventions (described in Part 2 of the standard). You need to declare the content types of the parts, as well as tell the consuming application where it should start (via the package relationship), so even you most simple document will have at least 3 files within the ZIP package. Before creating the ZIP file, let's just create a folder somewhere and in that folder create the following files:

  • document.xml
  • [Content_Types].xml
  • _rels/.rels

document.xml

The first thing we need is the actual XML that describes the content of the document. We're going to create three separate paragraphs. The simplest version of this would just have the one paragraph with some text:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<w:wordDocument xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
  <w:body>
    <w:p>
      <w:r>
        <w:t>Hello World!</w:t>
      </w:r>
    </w:p>
  </w:body>
</w:wordDocument>

_rels/.rels

How does a consuming application know where it should start when opening an OpenXML file? The first place you always look is the package relationships file. The package relationship file will always be located in the "_rels" directory, and it's always called ".rels". We need to create an XML file that tells the consumer that "document.xml" the first place you should go, and that this type of document is an Office Open XML document:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
  <Relationship Id="rId1" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument" Target="document.xml"/>
</Relationships>

Notice that a relationship has three main attributes. It has an Id attribute who's use will be more obvious in a bit. The Target attribute tells you where to go, and the path is relative to the parent directory of the "_rels" folder that relationship file is in (in this case that's the root directory). The Type attribute describes what kind of relationship it is (ie what kind of stuff is it pointing at).

[Content_Types].xml

Every Office Open XML file must declare the content types used in the ZIP package. That is done with the [Content_Types].xml file. We currently have two parts in this document that we need to declare content types for. The first is the document.xml part; the second is the _rels/.rels part. So, the content types file should look like this:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Types xmlns="http://schemas.openxmlformats.org/package/2006/content-types">
  <Default Extension="rels" ContentType="application/vnd.openxmlformats-package.relationships+xml"/>
  <Default Extension="xml" ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml"/>
</Types>

Notice that we've said every file with the extension ".xml" is of type wordprocessingml document. In a more complex file, that's not going to be the case, and we can instead use overrides for the content types of a specific part, rather than an extension type.

Create version 1 of our simple document

OK, we should now have three files. Two of the files are in the root directory (document.xml & [Content_Types].xml); and in the "_rels" directory we have the ".rels" file. Select the two files and the "_rels" directory and ZIP them up. Make sure that when you zip them up, the two files and the _rels directory are all at the root level.

Open this file in Word, and you now have a simple file.

Part 2 - adding a picture

Now let's add the picture to our document.

document.xml (version 2)

Open the document.xml file and add one more paragraph that specifies the image as follows (make sure you also include the two additional namespace declarations:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<w:wordDocument xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
  <w:body>
    <w:p>
      <w:r>
        <w:t>Hello World!</w:t>
      </w:r>
    </w:p>
    <w:p>
      <w:r>
        <w:pict>
          <v:shape id="_x0000_i1025" type="#_x0000_t75" style="width:250; height:200">
            <v:imagedata r:id="rId4"/>
          </v:shape>
        </w:pict>
      </w:r>
    </w:p>
  </w:body>
</w:wordDocument>

OK, so we added some XML that describes a picture anchor, but what about the actual picture? That's handled with a relationship. Notice that the v:imagedata tag has the r:id="rId4" attribute. This says that you need to go to the relationships file of the document.xml part and find the relationship with an id of rId4.

_rels/document.xml.rels

We now need to create a new relationships file in the _rels directory. Every part is allowed to have a relationships file. The name for the relationship file is just the name of the original part with a ".rels" added to the end. The file is always placed in the _rels directory which is in the same directory as the part itself. In this case it's "_rels/document.xml.rels"; but if the document.xml file was in a "word" directory (word/document.xml), then the rels file would be "word/_rels/document.xml.rels".

We need to create one relationship with an Id of "rId4" that points to an image, so our _rels/document.xml.rels file should look like this:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
  <Relationship Id="rId4" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/image" Target="image1.jpg"/>
</Relationships>

Notice that this time, the relationship type is an image type. It points to a part called image1.jpg so we'll need to create that next.

image1.jpg

You can use any image you want here. I just took this image file and saved it into the root directory with the name "image1.jpg"

[Content_Types].xml (v2)

We've added two new parts to the file. The first part "_rels/document.xml.rels" uses the ".rels" extension which already has the content type declared. The second part though "image1.jpg" uses an extension that we haven't yet declared, so we need to update the content types file as so:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Types xmlns="http://schemas.openxmlformats.org/package/2006/content-types">
  <Default Extension="jpg" ContentType="image/jpeg"/>
  <Default Extension="rels" ContentType="application/vnd.openxmlformats-package.relationships+xml"/>
  <Default Extension="xml" ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml"/>
</Types>

Create Version 2 of our simple document

OK, now you should have 5 files. In the root directory you have "image1.jpg", "document.xml", "[Content_Types].xml"; and in the "_rels" directory you have ".rels" and "document.xml.rels". Take all those files and ZIP them up. You should now have a document with a simple paragraph followed by a picture.

Part 3 - adding a hyperlink

Now let's add the last piece, which is a hyperlink.

document.xml (version 3)

Here, we need to add one final paragraph. This paragraph will contain a run of text that has the hyperlink tag around it:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<w:wordDocument xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
  <w:body>
    <w:p>
      <w:r>
        <w:t>Hello World!</w:t>
      </w:r>
    </w:p>
    <w:p>
      <w:hyperlink r:id="rId2">
        <w:r>
          <w:rPr>
            <w:color w:val="0000FF" w:themeColor="hyperlink" />
            <w:u w:val="single" />
          </w:rPr>
          <w:t>Click here for Brian Jones' blog.</w:t>
        </w:r>
      </w:hyperlink>
    </w:p>
    <w:p>
      <w:r>
        <w:pict>
          <v:shape id="_x0000_i1025" type="#_x0000_t75" style="width:250; height:200">
            <v:imagedata r:id="rId4"/>
          </v:shape>
        </w:pict>
      </w:r>
    </w:p>
  </w:body>
</w:wordDocument>

Notice again that rather than reference the resource directly, we use a relationship. With the image, we used the relationship with an id of "rId4", and with the hyperlink we're going to use the relationship of "rId2". This is a general rule followed throughout most of the Office Open XML standard. Any reference to a resource (whether it be directly within the ZIP package, or it be an outside file) is done with relationships. This way you can do a quick scan of any document and quickly understand all the components that make up that file. It also makes link fix-up or even cleansing significantly easier. You never need to read through the actual content XML pieces and can instead make you modifications directly to the lightweight relationship parts.

_rels/document.xml.rels (v3)

As described above, we need to create a relationship to the URL of the hyperlink we created. In this case, we'll not only use the three core attributes for the relationship (Id, Type, Target), but we'll also use a fourth attribute specifying that this relationship points to an external resource (ie something that lives outside of the ZIP package).

So, you should make the following modifications to our _rels/document.xml.rels part:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
  <Relationship Id="rId2" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/hyperlink" Target="http://blogs.msdn.com/brian_jones" TargetMode="External" />
  <Relationship Id="rId4" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/image" Target="image1.jpg"/>
</Relationships>

Create Version 3 of our simple document

OK, so the V3 pass on this didn't add any additional files to the package, so just take the same collection of files from Part 2, and ZIP them up. Open the file in Word and you should see a simple paragraph; a hyperlink; and a picture. Viola!

-Brian

Posted Monday, October 16, 2006 4:56 PM by BrianJones | 7 Comments

Filed under: ,

Comparison of OpenXML math and MathML

Murray Sargent who was the architect of the new math functionality in Office 2007 has a blog post where he talks more about the design of the Math schema in the Office Open XML formats (http://blogs.msdn.com/murrays/archive/2006/10/07/MathML-and-Ecma-Math-_2800_OMML_2900_-.aspx). Murray has also been active in the MathML community and is even joining the MathML 3.0 working group.

I gave some background information on why we made the decision to support MathML, but not to use it in the Office Open XML formats. We support MathML on the clipboard, and also have XSLTs that go both ways (from MathML into Office Open XML and from Office Open XML to MathML). Murray goes into much greater detail on this in his post. Here is a small snippet from his post:

Naturally there's been a lot of discussion as to why we even have OMML, since MathML is really good. Brian Jones has addressed that issue in some detail in his Open XML Formats blog. The main problem is that Word needs to allow users to embed arbitrary scan-level material (basically anything you can put into a Word paragraph) in math zones and MathML is geared toward allowing only math in math zones. A subsidiary consideration is the desire to have an XML that corresponds closely to the internal format, aiding performance and offering readily achievable robustness. Since both MathML and OMML are XMLs, XSLTs can (and have) been created to convert one into the other. So it seems you can have your cake and eat it too. Thank you XML!

-Brian

Posted Thursday, October 12, 2006 6:58 PM by BrianJones | 9 Comments

Filed under:

Office Open XML diffing tool available

I just saw this on Doug Mahugh's blog and it's really cool. Stephane Rodriguez has built a tool that allows you to view a diff of two Open XML files. I was actually bugging a number of folks to see if they would build something like this and it's awesome to see someone outside of Microsoft stepping up and pulling this together.

I've seen more and more momentum from folks in the community building solutions and now that the standard has solidified and is in it's final draft form I think we'll see a huge increase developers taking advantage of the new formats. If you have a solution you've built, I'd love to hear about it. Feel free to send it on via the comments functionality on the blog or you can click here to e-mail me directly.

The best site though for collaborating and sharing ideas around the Open XML formats is of course the openxmldeveloper.org community. It's pretty cool to see all the folks that have already delved into the earlier drafts of the standard and have positioned themselves to be the leading experts in what will become one of the most widely used file formats in the world.

-Brian

Posted Wednesday, October 11, 2006 4:29 PM by BrianJones | 4 Comments

Office Open XML final draft!!!

As I already mentioned, in the last face to face meeting in Trondheim, Norway we unanimously voted to approve the final draft of the Office Open XML spec as ready to submit to the Ecma General Assembly. The GA will then review the draft and in December there will be a vote to approve it as an Ecma Standard!

This is a huge milestone, and the entire technical committee has worked extremely hard over the past year. We really had an amazing collection of contributors to this standard, and if you take a look, it will show: http://www.ecma-international.org/news/TC45_current_work/TC45-2006-50_final_draft.htm

For those of you interested, here is the list of all the organizations contributing to the standard:

  • Apple
  • Barclays Capital
  • BP
  • The British Library
  • Essilor
  • Intel
  • Microsoft
  • NextPage
  • Novell
  • Statoil
  • Toshiba
  • The United States Library of Congress

We posted the draft in three separate formats. There is a PDF version; a tagged PDF version (for accessibility); and a DOCX version.

The final draft is still broken out into 5 separate parts:

  1. Fundamentals – gives an overview of the structure of the formats, and describes all allowed parts; content types; and relationship types.
  2. Open Packaging Conventions – describes the basic conventions used for storing the parts of the file within a ZIP package. *
  3. Primer – gives a great description of all the markup languages and how they work. This serves as a great tutorial.
  4. Markup Language Reference – contains detailed descriptions of each and every element; attribute; and simple type. Serves as a great reference when you want to look up what an element means. **
  5. Markup Compatibility and Extensibility – describes how additional markup can be added to the format while still conforming to the spec

* Part 2 has a couple additional electronic resources. There are a few XSD files, as well as the equivalent RelaxNG files (we were lucky enough to have Rick Jelliffe help in the creation of these).

** Part 4 has a collection of XSD files and the equivalent informative RelaxNG files. There are also a collection of predefined cell and table style references for spreadsheetML, as well as a collection of predefined shape and text warp geometries for drawingML.

I've been giving pretty frequent updates on the progress of the spec, so most of the content at this point won't come as a surprise. We spent the last few weeks in the committee nailing down any potential interoperability issues, which included a new schema that allows applications to clearly define additional characteristics that may assist consumers in better handling their files. For example, it's possible to define what level of arithmetic precision was used for Spreadsheet formula calculations, so that a consuming application can accurately display the same results.

We're already seeing hundreds of developers working with the earlier versions of the draft, and this final version will really help everyone who's been waiting for it to solidify. If you go over to the openxmldeveloper.org site, you'll see there are almost 600 registered members and an extremely active discussion forum. There's also talk of starting up a blogging collection so that the members can actively blog about the solutions they are building. It's exciting seeing the diverse set of solutions; from document assembly on a linux box, to mind manager solutions that output wordprocessingML.

I'm already getting excited for what we do with version 2 of the spec (but I could use a little break between now and then). Here are a few fun facts about the work that's gone on over the past year:

  • 72 presentations were given to the technical committee explaining the existing behaviors of features so that discussion on how to best structure and document it could then take place.
  • 66 hours of live meeting discussions (starting at 6am every Thursday for those of us on the west coast of the US)
  • 88 schema files
  • 128 hours of face to face meetings held in Brussels (ECMA); Cupertino, CA (Apple); London (British Library); Sapporro, Japan (Toshiba); Redmond, WA (Microsoft); Trondheim, Norway (StatOil)
  • 6,000 pages of documentation between the 5 parts of the standard
  • 9,422 different items to document (3,114 attributes, 2,500 element, 3,243 enumeration, 567 simple types)

-Brian    

Posted Monday, October 09, 2006 10:23 AM by BrianJones | 22 Comments

Random Friday thoughts (from 30,000 feet)

Well I'm currently flying over Canada on my way back from Trondheim, Norway and the most recent Ecma face to face meetings hosted by StatOil. I was worried that the wireless internet support wouldn't be available but I guess the connexion service isn't going to stop until the end of the year. It's a bummer they aren't going to keep it going, but at least I can still take advantage of it for this flight (there's something really cool about having the ability to IM with my wife while I'm flying home).

We accomplished an incredible amount during the past 3 days of face to face meetings. There should be a status report sometime next week that explains all that was accomplished and what the state of the specification is. I'm really excited though about where we are and what we've accomplished over the past year.

Hell and back

The airport that you fly into for Trondheim is actually in a small town called "Hell," which of course led to a number of jokes during the week (obviously I couldn't resist either). Trondheim was a beautiful city about midway up the coast of Norway. I'm about a quarter Norwegian and it was my first chance to get back to the homeland. I lived in Iceland for a couple years, and have visited Sweden, but never made it over to Norway. Although I was only there for a short period of time, I was really impressed. Here are a couple pictures I took while walking down along the waterfront:

And here are a couple pictures I took on a tour we took of an old Norwegian village/farm:

Intel and Microsoft Partner to enable Office Open XML formats as key to sharing business information

Not sure if you've seen the news, but Intel and Microsoft are partnering to focus on integration of RosettaNet and the Office Open XML formats. Here's another great example of the power you get with the open formats. This is something that would have been nearly impossible to pull off with the legacy binary formats, but now that we've moved to XML a whole world of possibilities is now available. RosettaNet is all about sharing business data, and the Open XML formats help to unlock all that valuable data that exists in the billions of Office documents.

Murray Sargent's Blog

Murray Sargent, who I've already mentioned as being the key architect behind the new math support in Office 2007 recently posted about how he got into Technical Wordprocessing. As you can see, he has a huge amount of experience in the space, and the Office products have benefited greatly from his work for years. Here's a bit about how he got started:

When I finished my PhD in 1967, I went to Bell Labs to continue working on laser physics and after a year got seduced by the idea of labeling graphs with real built-up, i.e., 2D, mathematical expressions. To this end, I created the SCROLL language (string and character recording oriented logogrammatic language), which was the first language capable of "typesetting" mathematical equations on a computer. I published it in AFIPS Conf. Proc. 35:525-536, AFIPS Press, Montvale, N.J. (1970). Admittedly SCROLL's typography was pretty limited. For example, the user had the responsibility of spacing the math, in contrast with TeX, Word 2007, and other sophisticated systems. But it was the first program capable of displaying built-up math, and it was really fine for that time to be able to show nicely labeled results at various conferences.

Open XML formats in older versions of Office

Jon Bailor recently posted on the Word team blog site discussing the free updates that are provided to allow users of previous versions of Office to open and save files in the new Office Open XML formats. This was a huge effort, and we had a number of folks working specifically on this functionality. It's always exciting to see resources spent on an effort like this because it was so clearly the right thing to do. The file formats make Office a more valuable platform, and because of the free updates this is also the case for the existing base of Microsoft Office customers.

Zeyad Rajabi also has a post up on the Word team blog site about file format compatibility in Word 2007. He talks about the work that was done to make the migration to the new formats as easy as possible via compatibility modes.

Bulk Upgrade existing set of binary Office document into Office Open XML

I've had folks ask about this so I figured I'd link to some more information on the bulk upgrade tools that are going to be provided to allow migration from the old binary formats into the new Open XML formats. It's part of a larger tool called the OMPM (Office Migration Planning Manager) which allows you to quickly scan your networks for Office files and see which ones may need more attention during the upgrade process (ie files with VBA, etc.). http://technet2.microsoft.com/Office/en-us/library/d0373697-31f5-4fc5-8dd1-1b9d7f35842f1033.mspx?mfr=true

American Football

Well, I was unfortunately out of town for one of the best weekends of football in Seattle in quite some time. I was able to watch some of the Husky game from the airport and I'm really regretting missing that one. It was originally supposed to be at noon so I could have gone to the game and still made my flight, but they pushed it back to 4 or 5 so they could get it on TV. I was able to watch the Seahawk game from my hotel in Trondheim. There's a pretty cool service over the internet where as long as you are outside of the US, you can pay a fee and watch a streaming broadcast of the game. So, while I wasn't able to go to either game, I at least was able to follow along remotely.

-Brian

Posted Friday, September 29, 2006 2:54 PM by BrianJones | 5 Comments

Filed under:

More Posts Next page »