Welcome to MSDN Blogs Sign in | Join | Help

I was wondering if someone more familiar with the ODF spec could help me out. I had made the following reply yesterday to Alex's assertion that ODF was as feature rich as Open XML and I want to make sure that I'm not misstating things:

I think you might want to dig a bit deeper into the formats. ODF does build on existing industry standards, but at times they are partial implementations, and it still leaves out a lot. For instance, Open XML actually uses more of the dublin core metadata schema than ODF does.

Another easy example would be to look at the different types of numbering for a wordprocessing file. In Microsoft Office you can say that the numbered list should be "first", "second" and "third" instead of  "1.", "2." and "3.".  ODF doesn't support that.

That's just the beginning though. If you are from another country like Japan or China, there is absolutely *zero* mention for how your numbering types are defined. The spec only specifies:

  - Numeric: 1, 2, 3, ...  
  - Alphabetic: a, b, c, ... or A, B, C, ...  
  - Roman: i, ii, iii, iv, ... or I, II, III, IV,...

No mention at all about what you do for any other language. If you use OpenOffice, they actually do support other languages, and they even save out those other numbering formats into the ODF  style:num-format attribute. The problem though is that behavior isn't defined in the spec, so how does anyone else that wants to read that document figure out what OpenOffice's extension means? Maybe I'm just missing something, as the ODF spec is really vague in a lot of areas, but I looked around for awhile and couldn't find anything.

Even if you don't pay attention to the things that are just flat-out missing from the format, the documentation for the things it does support is pretty minimal. In the latest Ecma draft, we have about 200 pages discussing the syntax of formulas for spreadsheets, ODF has a few lines. That gives me the impression that no one that does accounting or works on Wall Street was involved in the standard because I can't really imagine them allowing it to go through without specifying how formulas should be represented. It's no wonder the few applications referenced as being "full implementations" of ODF aren't even capable of full interoperability (link).

Alex then replied with the following:

OpenDocument is well known to support  variety of languages, and the Japanese ISO member pointed out a couple of problems with the spec. (mostly to do with international URIs). I think they would have noticed if numbering was a problem. The guys in the middle-east were looking at it too.

You're absolutely right about formulas; OpenDocument does not specify a syntax, and that is something the TC is working on. There is a wider problem here, though: formula syntax is something users know directly. Should OpenDocument do something new, or just what Lotus 1-2-3/Excel did/do? OXML has the luxury of only caring about compatibility with Office file formats; OpenDocument is designed to be widely compatible with all.

I may have jumped the gun when I stated that there was *zero* documentation, but I'm curious to know where in the ODF spec these things are specified. When I looked at the numbering section (4.3 , 12.2.2, 14.10.2) it was pretty light, and only called out those three styles I mentioned above. In section 12.2.2 there is reference to the approach used in XSLT for the format attribute, but it just says the attribute is done in the same way, not the same actual formats. The spec then states though that it only supports a specific set, and that it does not support all the different types the XSLT approach uses. The spec says that the number styles supported are ("1", "a", "A", "I", and "I"). Let's assume though that the spec was just worded improperly and it does in fact use the XSLT format approach to the full extent. Then why does OpenOffice output Japanese numbering format like this:

The XSLT spec says that you only put the first character of the list in the format attribute (or at least that's how I interpret it). I didn't see any mention of the approach of putting the first three characters followed by an ellipses.

That was using Kanji numbering. The XSLT spec actually does call out directly how to do Katakana numbering, and OpenOffice actually doesn't do that properly either (the XSLT spec says it should format="ア"). Instead, OpenOffice does this:

Now, for those familiar with Japanese numbers (and actually a whole host of other number styles) you know that it isn't always possible to represent a numbering style with just a single character . There are a couple different Kanji numbering styles that start with the same character (the difference is what you do once you get to 10). I assume that's why OpenOffice is going the route that it is.

Where is this approach documented though? Maybe I'm just misreading things here, and there is another portion of the ODF spec or the XSLT spec that allows for that approach? Or does this mean that if you are writing a Japanese document and use numbers with OpenOffice you aren't creating a valid ODF file? This newsgroup post implies that OpenOffice isn't yet fully supporting ODF so maybe that's the case? I suppose the response could just be that the format is extensible and you can place anything you want in that attribute, but how does that lead to interoperability? There's nothing to tell other applications how they should interpret that from as far as I can tell (again, I could be missing something obvious).

Almost every site I visit to find more information focuses almost completely on the marketing or political side of ODF. There are discussions around conformance, logo compliance, getting governments to support it, etc. etc. etc. I'm having a really hard time finding any good blogs or sites that discuss how to actually use it. I actually came across the oasis public mailing list archives that had some useful content, but I wasn't able to find anything about this issue.

-Brian

16 Comments
Filed Under:

I saw this article up on betanews this afternoon saying that just 24 hours after going live, there were already 200,000 downloads of Beta 2. I thought that was pretty impressive and asked a few folks about it. It turns out that we're actually now over 500,000 and the curve is actually ramping up, and not flattening. Who knows where we'll be by next week. This is awesome!

The first draft of the Open XML spec is published; over a half million people have already downloaded the product that uses Open XML by default; and a brand new developer community is forming. This is your chance to be one of the first to work with what will most likely be one of the most widespread file formats in history. You can be one of the first! Let's get these discussions going over at openxmldeveloper.org. There are about 300 or so members right now, and that was before we even had the beta out. Share your solution ideas, and find out what other people are doing. I'm looking forward to the discussions... most of my friends could care less about file formats, but thankfully I have you guys. :-)

-Brian

2 Comments
Filed Under:

There is now a really easy way to provide feedback as you are using Office 2007. Directly from Joe Friend's blog:

Sending Feedback

We have a great program called "Send a Smile" that provides you the opportunity to tell us what you think about the product. Don't worry you can "Send a Frown" too! Just download the SaS application and install it, then as you are using the product and experience something you love/hate you can simply click on the smile or frown in the notification area at the bottom right of your screen:

Next you'll see a dialog that includes a screenshot of the current application in use and a text entry box to include your comment. You can choose not to send the screenshot if you so wish or you can update it to make sure it shows the issue you're writing about. The more details you put in your comment the better. Click on the image below to see what this dialog looks like.

What happens to this information once you submit it?

The information (your comment and screenshot) is forwarded on to the Office team. We evaluate the comment/screenshot and make sure it gets routed to the correct team. Then we evaluate the information in order to determine if there are changes we need to make to the product. FWIW, we expect to get a lot of feedback, so it we can't respond directly to most, but we will be reading it as fast as we can.

So, download Office, load SaS and tell us what you're thinking!

-Brian

2 Comments
Filed Under:

Hard to believe it's only been a year since I first started blogging... so much has happened. It's really great that we went out so early with the news of the file format change, we started talking about it an entire year before we shipped a public beta. I think that helped to get a lot of the questions answered ahead of time, which is really important.

Speaking of the Beta, you may have noticed that the servers are really bogged down right now. A number of folks have had trouble getting the product keys. We are definitely aware of the problem, and there are a lot of people out here working to up the capacity. Sorry for the inconvenience.

-Brian

0 Comments
Filed Under:

There has been a great overall reaction to the news last week of Ecma's first public draft for the Office Open XML formats. One thing that is now absolutely clear to everyone that we are talking about an extremely rich and powerful set of file formats.

I think many folks didn't realize the amount of work we've had to take on, which explains why some had the false assumption that we could just use ODF. We were pretty clear in our response that it just wouldn't work for our customers because at the end of the day an open format is useless if the majority of our customers won't use it. That's why we had to make our formats fully support all the existing Microsoft Office files out there. If the formats didn't support all those features, then the only people who would use them are those that fundamentally want an open format; and everyone else would have just stuck with the old binaries. We absolutely did not want that to happen, we wanted everyone using an open format. We've invested a ton of resources into XML file formats because we believe it's a good thing, and we need to make sure that our customers will be willing to use them.

Let me be clear on a couple key points:

  1. Rich format - Yes the format is extremely rich and deep and that's because it represents a very powerful set of applications that have evolved over many years and many documents. It would have been completely unacceptable for us to create a format that didn't fully represent the existing base of Microsoft Office functionality. If we had created some kind of subset format, many in the industry would have complained for very legitimate reasons. People would have complained that we were destroying fidelity with the key features they used, that we were hiding functionality, not enabling everyone to exploit the rich features, not encouraging the move to XML, etc. Bottom line – millions of organizations would have had a legitimate problem.
  2. Extremely detailed documentation - It's funny but I've actually seen people complaining that there is too much documentation. The documentation is essential, even if there are parts that are not used by everyone. I personally think we have to provide documentation on every aspect of our format, otherwise how do you know what something means? This is a lot of work, and I believe it's absolutely necessary. I can't imagine there being a benefit to anyone from not documenting something.
  3. Full implementation - I don't think it should come as a surprise that with the rich set of features in Office, it's going to be a lot of work to build an application that can support all of that functionality. In the past, people had said that the reason nobody could build an application that matched Office was that the formats were locked up. Well, the format information was available, but not for all the many purposes that we are enabling now. Now all those people should be happy because the format information is complete enough to enable a full understanding by everyone. It's up to those other applications though to decide what level of support they want to build. While I think interoperability is possible, the struggles that the applications supporting ODF are having show that it's really a lot of work even for a format that isn't as deep. This is often to be expected though because the different applications have different sets of features as well as different implementations of the same features. That is how things work.
  4. Partial implementation - Now, if you don't care about fully matching Office features, then anyone can choose to just support a subset of the format. You can implement as much or as little of the format as you want. You can build an application that does nothing more than adds comments to a WordprocessingML document; or that automatically generates a rich SpreadsheetML file based on business data. It's up to you. The information is all there to use in the way that best benefits your application.
  5. Room for innovation - Now that all the features we've stored in our formats are fully open and documented, people are free to build with them. In addition to the fact that you can implement as much or as little of the format as you want, you are also free to add more to the format. The formats are completely extensible. You can add your own extensions to the format, or you can even join Ecma and propose that those extensions get added to the official Ecma standard. The strong support for custom defined schema in Office gives you a lot more power than what a document format on it's own would give you, through integration of your own parts.
  6. Microsoft does not own the standard - We no longer own these formats, Ecma does. I know there is still concern out there that these formats could change out from under you, but that's not something that Microsoft can do. Ecma fully controls it, and once it goes through ISO, it will be even more solid and locked down.

I'd also like to reuse some information that I left as a comment in my post last week. Some people were a bit confused on how you could create a standard that was so rich and had all the backward compatibility with the existing base of Microsoft Office documents. It was even suggested that almost as a way of leveling the playing field we should choose just a subset of features that we think everyone can build applications for. This would be a great move for our competitors but a horrible move for our customers. Adam provided a lot of feedback and I really appreciated that he took the time to write all that up. Patrick and Biff had some really great replies that tried to explain why backward compatibility was so important. Here is the reply I left for Adam that I hope helps really clear up his questions around why we went the standardization route in the first place:

Hey Adam, thanks for taking the time to get all your thoughts down. It definitely has helped me understand where you are coming from.

It sounds like you understand that from our point of view, in order to use an XML format as the *default* format for Office it needs to be 100% compatible right? I think you're point is more that we should also have an optional format that is more basic and doesn't necessarily have 100% of the features covered. That smaller more basic format would then be the one that should be standardized. I think that's what you are saying.

Based on your description, the format you desire sound a lot like HTML. HTML is a great format for basic interchange. It doesn't support everything that is present in an Office document, but as you said, that isn't always desirable. We've supported HTML for quite awhile, although we took the approach of trying to have our cake and eat it too when we attempted to make our HTML output support 100% of our features. The result was an HTML format that had a ton of extra stuff in it that many of the people who just wanted HTML didn't really care about (and it just got in the way).

Our primary goal this release with the formats was not to try and re-implement HTML, but instead to move everyone over to using XML for all of their documents. Let's talk about the motivations for what we are doing with Open XML since that was the main point of your question:

  1. The reason we've spent the past 8 or so years moving out formats toward a default XML format is that we wanted to improve the value and significance of Office documents. We wanted Office documents to play an important role in business process where they couldn't before. We wanted to make it easier for developers to build solutions that produce and consume Office documents. There are other advantages too, but the main thing is that Office documents are much more valuable in just about every way when they are open and accessible.
  2. The reason we fully document them is the exact same. We need developers to understand how to program against them. Without the full documentation, then we don't achieve any of our goals I stated above. The only benefit would be that other Microsoft products could potentially interact with the documents better (like SQL or SharePoint), but that doesn't give us the broad exposure we want. That would be selling ourselves short. We want as many solutions/platforms/developers/products as possible to be able to work with our files.
  3. The reason we moved to the "Covenant not to sue" was that a number of people out there were concerned that our royalty free license approach wasn't compatible with open source licenses. Again, since the whole reason for opening the files was to broaden the scenarios and solutions where Office documents could play a role, we moved to the CNS so that we could integrate with that many more systems. Initially we'd thought the royalty free license just about covered it, but there was enough public concern out there that that we decided we needed to make it even more basic and straightforward. We committed to not enforce any of our IP in the formats against anyone, as long as they didn't try to enforce IP against us in the same area. No license needed, no attribution, we just made a legal commitment.
  4. The reason we've taken the formats to Ecma for standardization is that it appeared that a number potential solution builders were concerned that if we owned the formats and had full control, we could change them on a whim and break their solutions. We also had significant requests from governments who also wanted to make sure that the formats were standardized and no longer owned by Microsoft. Long term archive-ability was really important and they wanted to know that even if Microsoft went away, there would still be access to the formats. We were already planning on fully documenting them, but the Ecma standardization process gave us the advantage of going through a well established formal process for ensuring that the formats are fully interoperable and fully documented. It's drawn a lot more attention to the documentation as well so I'm sure we'll get much better input, even from folks who aren't participating directly in the process.

I hope that helps to clear it up a bit. It really is just as simple as that. Any application is free to implement as little or as much of the format as they wish. If you really want every application operating on a more limited set of features, that isn't as much of a format thing as an application thing. You would need to get every application to agree that it will not add any new features or functionality, and will disable any existing functionality that the other applications don't have. That wasn't our goal. Our goal was to open up all the existing documents out there, and then anyone who wants to build solutions around those formats is free to do so. In addition, anyone is free to innovate on top of the formats, as I believe there is still a lot of innovation to come. The formats are completely extensible, so if someone wants to use the formats (or parts of the formats) as a base and build on top of that, they can do so as well. They can even join Ecma if they want and propose to add those new extensions to the next version of the standard.

-Brian

22 Comments
Filed Under:

A coworker of mine, just reminded me that for those folks interested in using the new Open XML formats but don't want to upgrade to Office 2007 Beta 2, we also released a public preview of the compatibility packs (thanks Jon!).

This is one of the other really important points I've tried to get across over the past year. These new formats aren't just for people who have Office 2007. You don't need to upgrade in order to use them. You can use these compatibility packs with your existing copies of Office, and you'll be able to read and write the Open XML formats.

-Brian

I just ran across this link today and was wondering if anyone else had checked it out: http://shudson310.blogspot.com/2006/05/docbook-xsl-170-released.html

It looks like there is a docbook project up on sourceforge that has XSLT support for going from WordprocessingML into DocBook (or the other way around). I haven't had a chance to look into it yet... sounds pretty cool though.

-Brian

2 Comments
Filed Under:

This is shaping up to be a pretty cool month. Last week we finally had the first public draft of Ecma's Office Open XML format standard released. Now today the first public beta of Office 2007 has been released.

I'd really encourage everyone to head up there and download a copy. It will give you a great chance to start working with the file formats and building solutions. Between the openxmldeveloper.org community and the Ecma draft, there should be a ton of information available to help you get started. Also my blog will continue to focus on the file formats for a long time to come, so you can always check back here too.

-Brian

7 Comments
Filed Under:

Today we have a guest writer to discuss the HTML output that we have in the new blogging functionality for Word 2007. His name is Zeyad Rajabi and he's a program manager on the Word team. Zeyad works on file format related issues, including the HTML support in Word. All of Zeyad's posts will be under the "Word HTML" category if you are interested in tracking those seperately. 

As some of you may know from Joe Friend’s blog, Word 2007 will allow users to author blogs straight from Word. I want to follow up on Joe’s blog by giving you guys more details concerning our XHTML output for the blogging feature. I hope to use this blog as an opportunity for you to comment on our blogging XHTML output and to make any suggestions.

Goals

Before I get into details about our XHTML output, I want to outline the goals for our blogging feature. The design goals behind the XHTML output from the blog tool are significantly different from what we’ve done in the past:

  • Output XHTML compliant code for each post (we are following the W3C spec)
  • Output clean and readable XHTML

Instead of concentrating on supporting 100% of Word’s features (as we did in the past) the blog feature will support a much smaller set of features and additionally concentrate on outputting clean and readable XHTML. The blog feature will only output the necessary XHTML needed to represent the document. No more redundant HTML or CSS. No more Microsoft Office specific CSS properties. We will output just clean and easy to read XHTML.

Known Beta 2 Issues

There are still some known bugs in the XHTML output for Beta 2. I wanted to point them out so that you aren’t surprised:

  • Strikethrough - We are outputting CSS property text-decoration for strikethrough instead of <del>
  • Divs around lists - We are outputting div tags for every list item. We do not need to output these extra elements
  • Block level elements within inline elements - We are not XHTML compliant in some cases because we are not following proper tag content flow. We are outputting block-level elements inside inline elements.
  • Multi-level lists - We are incorrectly outputting multi-level lists in terms of being XHTML compliant. We are outputting the incorrect XHTML in that we are closing the lists for before sub lists are closed
  • Table bloat - Our XHTML output for tables is too heavy and contains too much redundancy

I am sure there are more bugs to be found and I’m sure you guys will help me add to the list! As you play with the blogging feature, please feel free to send me any questions or suggestions you have. I want to make this feature great for all of us.

XHTML Output

There is too much to discuss in this first post, so I think I’ll break down the XHTML output into multiple categories: formatting, styles, lists, images and tables. I’ll have a separate post for each category so we can have some more targeted discussions. Another thing that I was thinking about doing was pulling all of this together as a public spec that I can post. Again, I would love for you to send me suggestions on any or all of the categories.

For those interested take a look at the source code the blogging tool generated for this post (note that it's only the contents that would go inside the ).

Formatting

Today let’s look at some details around the XHTML we output for formatting features:

Feature

XHTML

Hyperlinks

<a href="http://www.foo.com" target="_blank" title="Tip">hyperlink</a>

Font

<span style="font-family:XXXXXX;">text</span>

Font Size

<span style="font-size:28pt">text</span>

Font Color

<span style="color:XXXXXX">Colored text </span>

Bold

<strong>text</strong>

Italic

<em>text</em>

Underline

<u>text</u>

Strikethrough

<del>text</del>

Highlighter

<span style="background-color:XXXXXX">text</span>

Alignment

<p align="left">text</p>

<p align="right">text</p>

<p align="center"text</p>

Indent

<blockquote>text</blockquote>

Suggestions are Welcome

I know there are a couple different approaches for all of these. If you disagree with our approach let me know. I’ve read a lot of differing opinions on some of these (especially indentation), so while we probably won’t get everyone to agree 100% on the approach, hopefully we can find the best approach.

Anything missing? Is there a better way of representing a feature in XHTML?

Wow, we finally have an updated draft of the Ecma Office Open XML formats standard! http://www.ecma-international.org/news/TC45_current_work/TC45-2006-50.htm I've been waiting for a long time to be able to share all the great work that's been going on in Ecma TC45, and it's so awesome that we have a new public draft. I can't wait to hear what everyone thinks. If you go to that site, you'll see three different downloads:

  1. Draft 1.3 of the spec - The big download is the spec itself in PDF form. It's about 25 megabytes and is around 4000 pages.
  2. Draft 1.3 of the spec in the Open XML format - Alternatively, you can download the .docx version of the spec. Once Beta 2 comes out, you can open it that way (although opening 4000 pages of content with beta software may be slightly problematic <g/>)
  3. Schemas - The schema files are also available for download. They are available in a ZIP file, that also contains an index.htm file that describes each xsd

We've been working really hard over the past 5 months bringing this standard along. There is still a lot of work to do, but you'll see pretty clearly that we've made a ton of progress over the initial submission from last year. We have weekly 2 hour phone conferences (they are actually at 6am my time which is not ideal <g/> ), as well as 3 day face to face meetings about every 2 months. The contributions from everyone has just been outstanding. It's so awesome to work with such a diverse group of people. While the initial submission was made by Microsoft, it's now completely in Ecma's control and we've had a lot of help from Apple, Barclays Capital, BP, The British Library, Essilor, Intel, Microsoft, NextPage, Novell, Statoil, and Toshiba.

***Note*** Remember that this is just a draft. Some sections of the spec are much further along than others, so keep that in mind while you are looking through the spec. If you are in an area that looks like there isn't much information, odds are we just haven't gotten to that yet.

While I'm sure we'll be able to spend the next several months talking about all this, some of the big things I wanted to point out are:

  1. Public feedback - While the Ecma organization is completely open and anyone can join, I understand that some people just aren't able to make that commitment. That's why I was really excited that we have a mechanism set up now so that anyone can give feedback on the spec: ecmatc45feedback@ecma-international.org
  2. Technical discussion - If you are looking for technical discussions around the formats, you can also go to the openxmldeveloper.org site where there is a forum for a wide range of technical issues for developers who want to implement the formats.
  3. Navigating the PDF - The PDF file was actually generated using Word 2007. Bring up the Bookmark pane and you can easily navigate through the document structure (it's over 4000 pages, so that helps a lot!). You will also notice that in the reference sections, you can easily navigate through element and type reference just by clicking on the section number next to the element or type's name.
  4. Spreadsheet Formulas - Check out 15.5 (starts on page 247). There are about 160 pages of content describing the formula syntax and about 360 different functions. You'll notice that there is still a ways to go, but this is already a huge amount of really useful information.
  5. Depth of documentation - I know we've said this a million times, but this is a huge project. Migrating all the existing Office documents into an Open XML format and then providing full documentation is a ton of work. Many people don't realize how large these applications are, and how much there really is to cover. If you want an example, download the spec and look a the documentation for the simple type "ST_Border" which starts on page 1617 (it's in the WordprocessingML reference section under simple types). That shows a list of almost 200 legacy border patterns that you can apply to objects in a Word document. Tristan Davis, the Word representative on the Technical Committee, had to wok on every single one of those and provide images so anyone else could reproduce them. He created almost 200 documents, took screenshots of each one, and then provided the description and image representation in the spec. This format is 100% compatible with the existing base of Microsoft Office documents, so nobody will need to worry about losing features, even if it's the "Maple Muffins" border style (page 1643) :-)
    1. Want some more depth? - Check out section 14.5 starting on page 135

I'm so excited right now, I'm really rushing just to get this blog post out. I can't wait to hear from people about what kinds of questions they have, or what they hope to do with the formats. We've going to have a lot of fun over the coming months (especially once Beta 2 is out the door and everyone can start to experiment with the files). More information to come, but that's it for now.

-Brian

41 Comments
Filed Under:

There is a new article that will be in the June MSDN magazine that shows how to work with a Word document's properties programmatically. The author, Ken Getz has actually been working on building a collection of code snippets that people can use to do different things with Open XML files. I believe most of the snippets he's building leverage the WinFX system.io.packaging apis for cracking the ZIP and walking through the relationships (but of course you could use any ZIP library).

I'm not sure when the rest of the code snippets will be posted, but when they are, I'll definitely link to them. I'm sure you'll also be able to find them up on openxmldeveloper.org. I've actually played around with a number of them already and there are some pretty cool ones. He has is about 10 or so lines of code and it will remove any hidden text from a Word document. I modified it slightly so that you could also specify style names that you want to have removed. An example use of this would be if your company had certain styles to mark text that you didn't want to be shown externally. You could just run this small bit of code against any files as they posted to any external facing sites.

(man, I'm really loving this Word blogging tool too!)

-Brian

0 Comments
Filed Under: ,

Surprisingly, I haven't seen much information out there discussing the performance impacts of XML tag name lengths (ie using "<c>" instead of "<table-cell>"). My last post about some of the design goals behind SpreadsheetML raised some questions from folks about where the time is actually spent when loading an XML file. There are a ton of things that we do to improve the performance when opening and saving Open XML files. The move to using these formats as the default for Office 2007 meant we had to get really serious about how the formats were constructed so they could open efficiently. I'd be really interested to hear from other people who've worked on XML formats if they've had similar experiences.

For a lot of people who have worked with XML, that parsing of tags isn't really more than a percent to two of the overall load and save times. With office document formats, that's not always the case. Just to give you a bit of an idea about the scale these documents can get to, check out the article by George Ou about performance of spreadsheet files: http://blogs.techrepublic.com.com/Ou/?p=120

In that article, the Spreadsheet he uses is pretty big, but we have definitely seen much larger and more complex spreadsheets, so don't assume that it's a fringe case. If you save from the article using the new Open XML format, you get the following:

  • File size compressed - 16 megabytes (131 megs uncompressed)
  • Number of XML elements - 7,042,830
  • Number of attributes - 9,639,043

So, as you can see, that's a lot of XML to parse over. As we looked at files like this, we saw that we absolutely needed to find different ways to optimize the formats to make them faster. Using shorter tag names was one of the first obvious ones.

Impact of long tag names vs. short tag names

In the profiles that we've looked at over the years, we've seen that simply using shorter tag names can significantly improve the performance depending on the type of file. For those of you really interested, you should do your own profiles and let me know what you find out. Remember that for an application like Excel, we're talking about the potential for millions of XML tags to represent a rich spreadsheet. Let's look at a couple issues now:

  1. Compression - Since we use compression (ZIP), there isn't much a difference in file size since a long tag name and short tag name will pretty much compress to be the same size. This also means that time spent hitting the hard drive or transmitting over the wire will be about equal. When you do the actual compression though, if the tag names are longer, than there are just a lot more bits you need to read through to figure out the compression. These bits may be in memory, they may be on disk, but either way you need to deal with them at some point when compressing them. The same is the case for decompression, you will end up generating a lot more bits if the tag names were longer, even if the compressed bits are significantly smaller.
  2. Parsing - We of course use a SAX parser to parse our XML, and in most cases we also use a Trie lookup which is super fast (in other cases we use a hash). When using the hash, we of course still have to store the full tag name for a final comparison, because we don't have a known bound set of element values coming in. Not only do we allow for full extensibility, but we also have to also allow for the fact that people might make a mistake when generating the files and we need to be able to catch those errors. For those familiar with hashing, you'll know that unless you are guaranteed a perfect hash, you also need to have a second straight string compare to ensure it was a proper match. So both for memory as well as processing time, tag length has a direct impact. The time taken for a Trie is directly proportional for the tag length. For a hash, it really depends on how you do your hash, and how you do your verification.
    • One drawback to the Trie is that it's more memory intensive. In most cases we make that tradeoff though because but it's super fast. You can really see though how tag names have an impact all over the place. There are memory issues, parsing times, compression times, etc.
  3. Streamed decompression and parsing - As the XML part is being decompressed, we stream it to the parser. SAX is connected directly to the part IStream which then decompresses on demand. On the compression side, it's probably interesting to point out that we don't always compress each XML part as a whole. Instead we keep a single deflate stream and flush the compressed data when the "current" part being written out changes. For most parts we write them out as a whole, but there are some parts where that isn't the case.

I know that for a lot of people who've played around with XML, the parsing isn't really something that you would think of as being a major part of the file load times. This is not the case with office document formats, and especially spreadsheet documents.

With the latest SpreadsheetML design, we've seen that the XML parsing alone (not including our parsing numbers, refs, formulas) can often range from 10-40% of the entire file load. That's just the time it takes to read each tag and each attribute. This shouldn't be too surprising though, as the internal memory structures for a spreadsheet application should be fairly similar to the shapes that are used in the format design. A big piece is just reading the XML in and interpreting what the tags are. 

Example

SpreadsheetML was designed so that for any tag or attribute that would appear frequently, we used super short tag names. We also established naming conventions for the abbreviations shared across all three formats (so that they become easier to interpret as you work with them). Elements that may only appear once in a file often have longer tag names, since their size doesn't have nearly the same impact. Right now, most of our frequently used tag names are no more than a couple characters in length. Let's imagine instead we decided to use longer more descriptive names so each tag was around 5 times larger (you can use the older SpreadsheetML or the OpenDocument format for examples of longer tag names):

Short tag example:

<row><c><v>1</v></c><c><v>2</v></c><c><v>3</v></c></row>
<row><c><v>4</v></c><c><v>5</v></c><c><v>6</v></c></row>

Long tag example:

<table:table-row table:style-name="ro1"><table:table-cell office:value-type="float" office:value="1"><text:p>1</text:p></table:table-cell><table:table-cell office:value-type="float" office:value="2"><text:p>2</text:p></table:table-cell><table:table-cell office:value-type="float" office:value="3"> <text:p>3</text:p></table:table-cell></table:table-row>
<table:table-row table:style-name="ro1"><table:table-cell office:value-type="float" office:value="4"><text:p>4</text:p></table:table-cell><table:table-cell office:value-type="float" office:value="5"><text:p>5</text:p></table:table-cell> <table:table-cell office:value-type="float" office:value="6"><text:p>6</text:p></table:table-cell></table:table-row>

For that example, the top one is using SpreadsheetML from the Ecma Office Open XML format. The second example is using the OpenDocument format. There is another optimization that SpreadsheetML does where you can optionally write out the column and row information on cells, but I removed that since it's actually another performance optimization that I'd like to discuss in a separate future post (and as I said it's optional).

Tag length impact

Let's imagine we have that file I mentioned earlier with 7 million elements and 10 million attributes. If on average each attribute and element is about 2 characters long, then you have 34 megabytes of data to parse (which is a ton), just in tag names and element names. If instead though, the average length of an attribute and element were more like 10 characters, then your talking about 170 megabytes. That is a very significant difference.

This isn't rocket science of course. Most folks I've talked to agree that it's important to keep tag names short, especially in structures that are highly repetitive. In SpreadsheetML, you'll see that a lot of the element names actually are pretty long and descriptive, but only if they appear in a few places, and won't be much of a burden. Any element that can have a high frequency of occurrence is definitely kept to a minimum length.

Optimize based on your design goals

Remember, we're not talking about creating a format for hobbyists. This format is supposed to be used by everyone, and most of those folks aren't going to be happy with feature loss and performance degradation just so they can save out as XML (the average user doesn't care about XML). The original SpreadsheetML from Office XP was actually more like a hobbyist format, and as a result, it was really easy to develop against, but it was bloated and slow. I wish that we didn't have to worry so much about performance, but if you really expect these formats to be used by everyone, then you have to take the training wheels off. That's why the standardization in Ecma is so important though, so that we can ensure that everything is fully documented and all the information is there to allow you to develop against them.

I'll talk more about the other parts of the design that are optimized around performance in future posts. This was one that people had some questions on though so I just wanted to clarify and make sure there wasn't any more confusion. If you've looked at similar issues in your file format designs and found other interesting things like this, I'd love to hear about them! 

-Brian

24 Comments
Filed Under: ,

There were a few interesting articles I saw this week that I wanted to point people at.

http://www.mercurynews.com/mld/mercurynews/news/opinion/14543676.htm

"… if government locks in winners and losers, manufacturers will focus on courting government, rather than innovating. The recent technology policy debate in Massachusetts offers a case in point. In September 2005, after lobbying by IBM, Sun Microsystems and others, the state's Information Technology Division (ITD) announced that all government agencies must convert to computer systems that use OpenDocument file formats, an alternative to the Microsoft Office formats."

I thought that was a really interesting read. There aren't a lot of articles looking at the issue from this side. Let's allow people to choose the formats they want. I'm not sure anyone is opposed to choice. I don't know about you guys, but (like I said in the blog title) I'm looking forward to more discussion around the technologies instead of the policies.

http://www.oreillynet.com/xml/blog/2006/05/debunking_the_odf_bunk.html

"…we will be setting the precedence for a future where instead of fighting for market share with features, we will instead be fighting with favors to politicians, lobbyists, and/or any other source of so called advantage we think we can possibly gain through the legal channels, spending all of our development resources on these same mentioned channels, instead of putting that money into the development of the products themselves.

Whether anyone on the ODF side is willing to admit it or not, this isn't about document formats."

I'm hoping that with communities popping up like OpenXMLDeveloper.org, we'll start to see more and more folks talking about the technologies themselves and the awesome stuff you can do with XML. I want the discussions to be more around building solutions and innovating on top of these formats. I want to hear from folks about what they want to do with the formats. What kind of solutions are people building? Let's start sharing these ideas.

http://blogs.techrepublic.com.com/Ou/?p=196

"But when I asked Sun's engineers point blank if they had verified my numbers, they stated that they do not dispute the numbers and immediately proceeded to explain why it was slower than Microsoft's format. The reason Sun explained was that Sun has to use the open standards OASIS compressed XML format while Microsoft used its own proprietary binary file format which was essentially a very efficient memory dump that didn't require a lot of CPU cycles to process (approximately 95 times more cycles based on my tests). But then I pointed out that even when I tested Microsoft Office with its own 2003 XML format plus the time it took to compress the data, it was still approximately 5 times faster than OpenOffice.org. Sun's engineers explained that this was due to the fact that ODF took longer to process than Microsoft's XML format. At this point in the conversation, they've managed to convince me that the OpenDocument format was 5 to 100 times less efficient."

From our point of view, the move to XML formats was actually a scary one since the old binary formats for Excel were so damned fast. That's why we had to look really closely at every aspect of the SpreadsheetML design to see where we could make the load times faster. As I've said before, most end users don't care about XML, they just care about their files working. It was up to us to make sure that we can give the developers out there XML without having a negative impact on the end user.

I'm not sure if it's good blog policy to post on a Friday afternoon. Have a great weekend everyone. I hope you're doing something fun, and aren't stuck reading my blog (at least until Monday)!

-Brian

27 Comments
Filed Under:

Joe Friend has finally made it public that there will be built in blog functionality in Word 2007! I used it for authoring my last post, and I loved it. I wanted to mention it at the time, but didn't want to take away any of Joe's thunder. :-) I had to go through and clean up a couple things, but as Joe said, this feature is coming in a bit hot. Unfortunately, I'm still on Office 2003 here at home so I'm just writing this post in the web form (maybe I should have just waited until I got into work).

Beta 2 should be coming pretty soon, so here's yet another thing that I really think you all are going to love. The tool is really sweet; it takes advantage of content controls and the extensibility of the new user interface. It's also going to have hooks into some of the other applications which Chris alludes to...

-Brian 

6 Comments
Filed Under:

It's been awhile since I've talked in detail about the SpreadsheetML schema and I apologize. I had a number of posts back in the summer which talked through Office XP's SpreadsheetML format that we built about 6 years ago, but obviously a lot has changed since then.

The new SpreadsheetML that is part of the Open XML formats coming with Office 2007 had to undergo serious work in order to make it ready to be the default format. As you all know, the majority of folks don't really care about what kind of format they are using, they just want it to work (remember that most end users have never even heard of XML). We wanted our formats to play a more vital role in business processes though, which is why we've slowly been progressing towards these new default XML formats. We want people to be able to easily build solutions on top of the formats, but at the same time, we don't want the average end user to feel much noticeable difference with the change (at least no negative differences).

That leads me to why we had to restructure SpreadsheetML from the original design. The two issues with the SpreadsheetML format from 6 years ago was that it wasn't full fidelity, and it wasn't optimized for performance/file size. The term "Full fidelity" just means that everything that is in your file can be saved into the format without fear of it being modified or lost. The old SpreadsheetML format didn't support a number of feature like images, charts, objects, etc. So we had to add all those additional things to the format.

The second part (performance) was a really important and challenging one. We wanted to move to an open format so that people could build solutions around our formats. Like many other applications out there, we chose a combination of ZIP and XML to achieve this. We had to write the XML though in such a way that it could be parsed extremely efficiently so that the file open and save experience wouldn't get significantly slower. There have been a number of articles related to this issue, where people have complained about performance in other applications that use XML as their format. Of course we had to keep this in mind with our design, and for those of you who have played around with it I'm sure you've noticed the difference.

While I'm not going to go into a full description on the SpreadsheetML format, I'd at least like to give you a brief introduction. A SpreadsheetML package has a few different pieces that it's comprised of. Let's lok at a basic diagram of the pieces of a spreadsheet:

The main parts I wanted to call out for today are:

  1. "sheet1" – This is the data for the worksheet. Each worksheet is stored as its own XML file within the ZIP package which means you can easily get at your data within a particular sheet without having to parse all the other sheets.
  2. "sharedStrings" – Any string (not number, just string) used in the sheet is actually stored in a separate location. There is a part called the "Shared string table" that stores all the strings used in the files. So, if you have a column called "states", and "Washington" appears 100 times in the spreadsheet, it will only need to be saved into the file once, and then just referenced.

I think an example might be best to help show what I'm talking about. Let's take a spreadsheet that looks like this:

ID

Num

Resource

1

543

F068BP106B.DWG

2

248

F068BP106B.DWG

 

In the Open XML file, there would be an XML file that contained the strings used, that would look like this:

Shared String Table

<sst xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/5/main">
  <si>
    <t>ID</t>
  </si>
  <si>
    <t>Num</t>
  </si>
  <si>
    <t>Resource</t>
  </si>
  <si>
    <t>F068BP106B.DWG</t>
  </si>
</sst>

Then, in the main sheet, there would be cell values, and pointers into the string table wherever a string occurs:

Sheet1

<worksheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/5/main">
 
<sheetData>
   
<row>
     
<c t="s">
       
<v>0</v> 
     
</c> 
     
<c t="s"> 
       
<v>1</v> 
     
</c> 
     
<c t="s"> 
       
<v>2</v> 
     
</c> 
    
</row> 
    
<row> 
     
<c> 
       
<v>1</v> 
     
</c> 
     
<c> 
        <v>0</v> 
      </c> 
      <c t="s"> 
        <v>3</v> 
      </c>
    </row> 
    <row> 
      <c> 
        <v>2</v> 
      </c> 
      <c> 
        <v>0</v> 
      </c> 
      <c t="s"> 
        <v>3</v> 
      </c>
    </row>
  </sheetData>
</worksheet>

Notice that in the first row, each cell has the attribute t="s" which means it's a string. Then, the value is interpreted as in index into the string table, rather than an actual number value. In the 2nd and 3rd rows, the first two cells are interpreted as numbers, so they don't have the t="s" attribute, and the values are actual values.

This may seem a bit complex, but remember that while this format was designed for developers to be able to use, it we couldn't take the hit that comes with making it completely intuitive. Believe me, as a developer, I would have loved to make the formats more verbose and straight forward, but that would have meant that everyone else opening the files would have to suffer for it. If the example above was a more complex set of data with a number of separate worksheets, each with a few thousand rows, you can imagine how quickly the savings of the string table and terse tag names would add up. I had a couple posts back in the summer talking about some other basic things we do to make sure that the formats are quick and efficient.

This tradeoff of who you design around and how you way ease of use versus efficiency is something folks have to look at every day when they design products. Whether it's an API, a user interface, or a file format, you need to decide which target user you are going to give more weight to when you make your design decisions. We had to give more weight to the end user, and instead require a bit more knowledge from the developer. That's why the Ecma documentation is so important. We need to make sure that the format is documented 100% and there are no barrier to interoperability. The great group of people we have on TC45 are really helping a lot here. As I said last week, the Novell guys have already built some working code that allows Gnumeric to open and save Spreadsheet files in the Open XML format. I'm sure we'll see more and more implementations as we provide better documentation and get closer to a complete standard. It's really exciting! That's one of the great things we'll see more and more of up on the openxmldeveloper.org site.

-Brian

17 Comments
Filed Under: ,
More Posts Next page »