Brian Jones: Open XML Formats

Interoperability by design

posted Wednesday, June 14, 2006 5:41 AM by BrianJones

Today we announced the formation of a new customer council focused on interoperability (how to make technologies work better together). I'm sure you've noticed over time that Microsoft has made a strong commitment to work towards better interoperability, and this is a big step forward in achieving that goal. I personally have focused on interoperability issues for about the past 6 years or so in working on extensible technologies like the object model and both the HTML and XML file formats. It's something I've always viewed as a key piece of our product design, and it's exciting to see more momentum building around this.

Pulling a quote from the press release:

"The council, hosted by Muglia, will meet twice a year in Redmond, Wash. The council will have direct contact with Microsoft executives and product teams so it can focus on interoperability issues that are of greatest importance to customers, including connectivity, application integration and data exchange. Council members will include chief information officers (CIOs), chief technology officers (CTOs) and architects from leading corporations and governments. Representatives from Société Générale, LexisNexis, Kohl's Department Stores, Denmark's Ministry of Finance, Spain's Generalitat de Catalunya and Centro Nacional de Inteligencia (CNI), and the states of Wisconsin and Delaware have joined as founding members."

As I said, we’ve been committed to the idea of "interoperability by design" for quite some time now, but the actual "interoperable by design" initiative was kicked off by Bill Gates last winter (Feb '05). We've heard numerous times from our customers that interoperability is a "key IT priority." When we design our products we look at how they will interact with a large selection of other products and with a wide variety of hardware. We have very large testing matrices in place to help ensure they work. This new customer council will help us in huge ways though as they will be able to identify some real life issues that we hadn't yet thought of (or prioritized high enough). As we identify new issues we can then look to solving those as well.

You see a lot of folks talk about interoperability, but often they just don't mean the same thing. From our perspective it's something we want to build directly into the products so that it just works. Another approach that companies have taken is to talk about it from the perspective of building specific "projects" where consulting is done (for a fee of course <g/>) to wire together a number of separate bits. I've also seen that often companies will talk about interoperability when it comes to areas that they aren't really competitive in, but want to be. This often leads them to push towards less functional and innovative technologies in an attempt to level the playing field. This is a far different approach from what we are talking about, and I want to make sure there isn't any confusion. There were a couple key talking points around this announcement that I really liked, and that is that we're producing "people-ready" and "value-returning" interop solutions and this new council will help us to be even more successful in doing that.

The work we're doing in Ecma is obviously a great example of the "interoperable by design" concept. We've taken a product where one of the key complaints was that the file format was not documented, and not only moved to use open technologies (ZIP and XML), but we're working with a bunch of other companies (including some competitors) to make it a fully documented international standard.

If you want to learn more about interoperability at Microsoft, you should check out the interoperability site: http://www.microsoft.com/interoperability

-Brian

9 Comments
Filed Under: Office 2007

Learn more about Word 2007's support for seperating data from presentation

posted Monday, June 12, 2006 2:51 PM by BrianJones

If you're heading out to TechEd this week like I am, you should definitely plan on attending Tristan Davis' talk on Thursday afternoon that covers the new functionality in Word 2007 for custom XML solutions.

This talk goes into great detail on the true power of XML in Office applications. XML file formats are obviously important, but the really exciting stuff isn't what you can do with the wordprocessing schemas but instead it's with the support for your own schemas. People want their office documents to seemlessly interoperate with business processes and solutions, and custom schema support is the way you can achieve it. With schemas like Open XML and ODF, you are generating wordprocessing documents, spreadsheets, and presentations. With custom defined schema support, you can take it to the next level and instead create invoices, trip reports, product specifications, research reports, pitchbooks, reviews, articles, resumes, applications, etc., etc., etc. There are no limits to the types of documents you can create, as you have the ability to define the schema.

I blogged earlier this year on both the importance of custom defined schemas as well as the new content controls in Word 2007. There is also a new article up on openxmldeveloper that shows some more examples of how to drop your own XML into a .docx file and map the values into the surface of the document.

Here is the description of Tristan's talk where he'll show a number of examples as well as dig into the ways you can leverage Word 2007 to build powerful solutions:

OFC335 Microsoft Office Word 2007 XML Programmability: True Data/View Separation and Rich Eventing for Custom XML

Day/Time: Thursday, June 15 4:30 PM - 5:45 PM Room: 257 AB
Speaker(s): Tristan Davis

Microsoft Office Word 2007 brings a data model that allows data and presentation to be managed separately, extending the structured document concept introduced in Word 2003. This includes significant investments in the support for custom XML data in the new Office Open XML file formats, as well as rich access to that data from within the application. Developers can work directly against the XML data via XML mappings to the Word document, or via embedded InfoPath solutions in the Document Information Panel. In this session, we introduce these new capabilities, then dive into the functionality of the Office XML data store (which provides custom XML storage), and how it can be leveraged to build solutions that will strongly tie Word documents to your business processes.

Track(s): Office System
Session Type(s): Breakout Session
Session Level(s): 300

Hope to see you all there.

-Brian

1 Comments
Filed Under: Word, Office 2007, Conferences

TechEd Boston

posted Friday, June 09, 2006 7:56 AM by BrianJones

Who else is going to be heading out to Boston next week for Tech Ed 2006? I'll be out there for a couple days in the middle of the week. I'm presenting Wednesday from 5:30-6:45, so be sure to swing by and say hi.

Here's the information on my session:

OFC324 Microsoft Office Open XML Formats

Day/Time: Wednesday, June 14 5:30 PM - 6:45 PM Room: 253 ABC

Speaker(s): Brian Jones

Learn about the huge change that will affect the role Office documents can now play in business processes and solutions. Previously the binary formats had meant Office documents were treated more like a "black box," but that is no longer the case, as these open formats allow documents to serve as a first class source of data as they travel through workflow and other business. Document content can now directly integrate with systems new and old. Generation of documents based on business data for up-to-date and accurate rich content is now possible throughout your own solutions. This session delves into schemas, solution code, and numerous examples.

Track(s): Office System

Session Type(s): Breakout Session

Session Level(s): 300

I hope I'll get a chance to see you all out there. It should be a lot of fun!

-Brian

3 Comments
Filed Under: Office 2007, Conferences

Word XHTML - Mapping styles to semantics

posted Thursday, June 08, 2006 4:42 AM by BrianJones

This is the third post by Zeyad Rajabi who owns the XHTML output from Word's new blogging feature. In earlier posts, Zeyad discussed a general overview of the XHTML as well as a more detailed post on XHML compliance. Today Zeyad is discussing the ways in which styles have been directly tied to specific XHTML tags.

Today I wanted to talk a bit about the template that we use for the Word 2007 blogging feature. Word has always concentrated on the presentation of documents and making it easy for people to quickly create a great looking document. The area that we haven't focused on as much though is allowing people to better specify the semantic meaning of their content. We've been slowly moving in that direction with the custom XML support in Word 2003 and the content controls and XML mapping in Word 2007. We actually leverage a content control to allow you to specify the blog's title.

One of the oldest ways folks would specify semantic meaning in a Word document though was by using styles, and we've done work in Word 2007 to make styles much more convenient for the average end user. We've created a number of Word styles that we map directly to XHTML tags of semantic meaning (like and <blockquote>). We then let the browser and blog sites determine how to render these tags (based on stylesheets, etc.).

In Word 2007 one of our investments was giving our users easy access to applying styles via "quick styles". In our blog template we provide a list of styles that can be applied to the contents of the document as you can see in the screen shot below:

Styles

These styles are all significant in that we can map them directly to XHTML tags (rather than simply to formatting properties). Below is a table listing all the styles provided by our blog template and their XHTML equivalent.

Style	HTML
Heading	h1, h2, h3, h4, h5, h6
Normal	p
Quote	blockquote
Code	<pre><code>… </code></pre>*
Strong	strong
Emphasis	em

*The code style is being added post Beta 2.

What do you guys think of having a style called "Code" that actually nests pre and code together? Word differs from the web in that we automatically preserve whitespace, so in order for us to correctly output XHTML for the code style we also need to output the preformatted style.

One interesting discussion that came up in some of my previous posts was whether it was better to use and or and when people applied bold or italic formatting to their text. Having the and tags in the HTML guarantees that the look of the document will likely not change regardless of style sheet (it's more likely that would have CSS props than ). On the other hand, and provide much more flexibility in that they really only imply semantics and not display values. While and have a default presentation, it is often overwritten by the CSS of the page or the rendering engine.

Some people were saying that there are occasions where and better capture what the user intended. While I do agree that may be the case at times, I believe also our UI encourages the use of bold and italic when the user often was just trying to convey the semantics. In most cases, I believe they would have specified strong or emphasis if it were as easy and obvious as bold and italic (there just hasn't in been a benefit to doing so in the past). Since we are going the route of XHTML compliance and we are concentrating on structure rather than presentation, we opted to always output strong/em rather than a more confusing mixture of the two (bold/italic and strong/em).

Custom Styles

One area that we are looking at investing in is giving folks the ability to add custom styles to the blogging template that would then be output as a simple style tag. So unlike the above examples where the style is mapped to a specific XHTML tag, we would simply output a or where the class name then matches the style. So, if a user adds the style "foo" to their blogging template then when that style is applied we would output:

…..

We would not output the formatting information for the style because in most cases the CSS would be stripped upon publishing to a blog provider. Instead, with this approach, you could rely on the CSS of the host site of the blog to specify the presentation information for those custom styles.

Comments are welcome

Any comments or questions are welcome. Also let me know if there are any other similar structures you guys are interested in talking about next (ordered and unordered lists, definition lists)?

9 Comments
Filed Under: Office 2007, Word HTML

Thoughts on Open XML in ISO

posted Monday, June 05, 2006 11:30 AM by BrianJones

As we move forward with the standardization of the Office Open XML formats, it's interesting to look at the motivations that brought us to this point, but also to think about what is still to come. We've wanted to provide folks with easier ways to work with our formats for years now, mainly because it significantly increases the value of Office documents when they are fully documented. An open format can integrate with business processes; databases; and workflows in a much simpler and more powerful way (for more on why we made the move to open formats, read here and here). That's why we've worked so hard over the past 3 or 4 releases to invest in other formats like RTF, HTML, and XML. These new Open XML formats which will be the default format for Office 2007 (as well as work in Office 2000, XP and 2003) are the result of all that work. If you've read my blog at all you know that it's been a serious evolution and a lot of work, and I'm really excited about the potential. We already have hundreds of thousands of external developers building solutions on top of the XML formats from Office 2003 which weren't even the default formats so you can imagine how huge this move to a new default XML formats is.

One thing I've heard from a number of folks though is that they are wondering what the next steps will be for the formats once they are standardized. Well, ultimately that is up to the organization that has taken over the ownership and maintenance of the formats. We're currently standardizing the formats at Ecma international, which would mean that Ecma (which consists of representatives from a large number of companies in the industry) would own the formats as well as determine how the formats evolve. There has also been talk though of taking the formats to ISO once they have been approved by Ecma, which would mean that if ISO chooses to adopt the Open XML formats the stewardship of the formats would be theirs. We've had a number of governments indicate that they would like the formats to be given to ISO, and it's likely that after the Ecma approval that will be the next step.

A number of people have asked if the approval of ODF by ISO has an impact on the standardization of Open XML. I don't believe so given that ODF and Open XML have two very different goals in mind. Open XML was designed around compatibility with the existing base of Microsoft Office documents. There are literally billions of documents that exist today in those binary formats, and the goal of Open XML is to allow for a seamless migration from those old formats into the new XML formats. This is a huge undertaking, and it's the reason that the spec is so large. I think that given the obvious need for an open XML format that achieves these goals, and the fact that ODF was not designed for that purpose, it's clear that there isn't a direct conflict between the two formats and there is no reason ISO wouldn't want to approve and steward both formats. Rick Jelliffe, who has a wealth experience with ISO and standards has two posts that clearly call this out:

Gartner, Groklaw 0. Rick 1

"ODF, for example, will change in no substantive way in its ISO adoption. National body comments will be added to requests or requirements for future versions. The Ecma Open XML people, so far, are being far more concilliatory in this regard: they know that a Microsoft technology doesn’t have the presumption of innocence that a Sun format does, in the minds of many.

If Microsoft/Ecma/et al manage to demonstrate to the ISO member voters that Open XML had even a first round of openness at Ecma, that it has some different use from ODF, if it supports SC34 specs like RELAX NG, and is scrupulous in its partitioning of Windows-specific hooks to another layer or namespace, I don’t see any national body rejecting Open XML, frankly. Microsoft and Ecma still have work to do in this regard, but it is just the standard kind of technical-level education/discussion/wordsmithing/re-alignment that any specification should have. "

Open XML at ISO sideshow

"They are generating lots of media attention, FUD and lobbeying; but it ODF and Open XML both represent a victory for universal, ubiquitous, standard generalized markup, which is what SC 34 is in large part about. I see Gartner has estimated a less than 70% chance of ISO ratifying two XML office formats. What rubbish. I’ll know more next week.

Ultimately, it is not WG1 or SC34 that makes the decision. It is the national votes of each of the voting members of ISO: the national standards organizations like Standards Australia, ANSI, and so on. While local committees may feel that Microsoft has been conspicuous in their absense, so have the other big companies in recent years: the standards participation focus shifted to W3C and OASIS. But these committees are not stacked with anti-Microsoft (or anti-Sun) people, but with organizations who need good interchange and also need an XML retrieval for legacy documents in proprietary formats (.DOC, etc.). So I find it very difficult to agree with Gartner’s 70%; I’d put it the other way, with a 70% likelihood of success, at least.

ISO is not an anti-monopoly court. It is there to help people who want to agree on technology, providing procedures, forums and a publishing house.

But the issue of having two office standards is a fair one. I think all Microsoft needs to do is to distinguish Open XML from ODF adequately and prove that it has a credible alternative constituency who would not be served well by ODF. That there is overlap is immaterial if there is a significant difference."

For anyone who has played around with the XML formats, I'm sure you've seen that we really took seriously our goal of minimal user impact in the move to default XML formats. This included things like performance (which I've already briefly touched when I talked about spreadsheetML, tag lengths, and shared formulas), as well as full compatibility with the existing base of Office documents. I've just recently started to show some basic examples of where ODF just doesn't come through in terms of compatibility, such as with formulas, numbering formats, and highlighting (and these were just the first three things I came across... I have a growing list that I'll talk about over the coming months).

-Brian

12 Comments
Filed Under: Office 2007

Follow-up on PDF legal issues

posted Saturday, June 03, 2006 9:29 AM by BrianJones

There have been a ton of really great comments and questions today in relation to the news that we are going to have to pull our PDF and XPS publish support out of Office 2007. We will still offer the PDF and XPS publish functionality as a free download, but due to pressure from Adobe we are not able to ship it in the box. This is just an unfortunate added pain for our customers and doesn't really benefit anyone.

There were a few areas that I've seen a bit of uncertainty around lately, and it was mainly from folks wondering why Adobe would do this. While I can't say what the actual motivation was, I think I can help to clear up some of the speculation I've seen out there.

ISO 19005-1 compliant PDF/A

The first thing I've seen some folks suggesting is that the PDF output from Microsoft was somehow not following the PDF spec and that Adobe had to step in to stop that. I can assure you that is definitely not the case, and if you have any reason to believe that our PDF output was flawed, we would love to hear about it. If you have any feedback, you can go to Jeff Bell's blog which discussed our PDF support. He hasn't been too actively lately, but he's definitely still reading the comments. Cyndy also had a couple posts earlier on that gave more details on Office's PDF support (here and here).

To be clear, we worked really hard to follow the ISO standard for PDF. If you use the Beta 2 version of Office 2007, you'll see the following dialog when you choose to publish as PDF:

Notice that there are a number of options for how you publish your PDF. One of the key ones is to use the ISO 19005-1 standard for PDF:

You'll see that we really are trying to comply with the spec, and wouldn't have anything to gain by doing otherwise. Remember we are only a producer of this stuff (not a consumer), and doing anything non-compliant would just mean that our output would be flawed and not look right. That would of course undermine all the work we've done to build this support in the first place... we want people to use it.

Pluggable architecture

The second issue I've seen folks raise is that they thought we might have been blocking other people out from building their own solutions/formats into the product. That's also not the case. Check out this MSDN article that clearly explains how anyone could come along and add their own functionality into the publish feature: http://msdn2.microsoft.com/en-us/ms406051(office.12).aspx

You can see that we definitely are positioning Office as a platform for anyone to build solutions on top of. That's why we use the Open XML format as our new default format and it's why we focus on a rich object model.

Promoting XPS?

The third issue I've heard is that people think this may be a sneaky way for us to promote the XPS format. That's also not the case, as you'll see that we are removing the XPS support in the same way we are removing the PDF support. We actually separated the XPS support out because we wanted to make sure we weren't giving XPS an unfair advantage over PDF.

Closing...

Please understand that this really is a pretty straightforward issue. If we could include it in the product we would, but unfortunately we can't so we had to go with the next best thing (a free download). It would be great if this could all get worked out, but from looking at the articles, our folks have been in discussion with the Adobe folks for a number of months now, and there hasn't been any progress. It's really a shame.

-Brian

85 Comments
Filed Under: Office 2007, PDF

Legal issues around PDF support

posted Friday, June 02, 2006 7:23 AM by BrianJones

About 8 months ago we announced to our MVPs that we would provide PDF publish support natively in the 2007 Office system. We made the move due to overwhelming customer demand for PDF support, and it was received really well. The blog post I made around the announcement was probably one of my most widely read posts of the year.

Unfortunately, it doesn't look like we're going to be able to do the right thing for the customer now. There was a news article in the WSJ today (and now on CNet) indicating that Adobe didn't like that we provided the save to pdf functionality directly in the box, and so they’ve been pushing us to take it out. I'm still trying to figure that one out given that PDF is usually viewed as an open standard and there are other office suites out there that already support PDF output. I don't see us providing functionality that's any different from what others are doing.

It looks like Adobe wanted us to charge our customers extra for the Save as PDF capability, which we just aren't willing to do (especially given that other companies already offer it for free). In order to work around this, it looks like we're going to offer it as a free download instead. At least that way it's still free for Office users, but unfortunately now there is an added hassle in that anyone that wants the functionality is going to have to download it separately.

This really is one of those cases where you just have to shake your head. Adobe got a lot of goodwill with customers, particularly in government circles, for making PDF available as an open standard. It’s amazing that they would go back on the openness pledge. Unfortunately, the really big losers here are the customers who now have one extra hassle when they deploy Office.

This is also surprising to me given that certain governments have viewed PDF as being more open than Open XML, yet Open XML is now proceeding through Ecma and there is a clear commitment from Microsoft that it will not sue anyone for using the formats. Anyone can build support for our formats, and we've already seen people starting to do this today (a couple weeks ago I actually referred to a demo we saw from the Novell folks where they had a prototype of a product using the Office Open XML formats). I don't think this was the intention, but Adobe seems to be saying that PDF is actually not open (or that it is open for some, but not for others). I'm not sure that any of those government policy makes could justify this outcome.

Hopefully Adobe will decide that this is a mistake and that they probably shouldn't try to sue people for using an open file format. If you're like me and think this is just a bad thing all around, you should let them know.

-Brian

107 Comments
Filed Under: Office 2007, PDF

Highlighting in a document

posted Thursday, June 01, 2006 7:00 AM by BrianJones

I've had a lot of folks ask me to provide more information on what features are missing from ODF and why it was that we decided to create out own XML format (Open XML). I didn't want to get too involved in pulling together a full detailed list, but it's probably worthwhile pointing things out every once and awhile. Most of you know that ODF wasn't even around when we first started working on our XML formats, so that's really one of the big reasons. Another reason is that we need to make sure that we created an XML format that all of our customers could (and would) use. We want our customers to move all their existing documents into this new format and we need them to be willing to use it as the default format. ODF just wouldn't have allowed us to achieve that (both because of a lack of functionality as well as different optimizations that sacrifice things like performance).

An area I just came across today that really surprised me was highlighting. I'm sure most folks are familiar with highlighting in a Word document. You can use highlighting to call attention to different areas in a document either for yourself or to point things out to others. The key about highlighting is that it does not affect any other formatting. Character shading (aka background-color in ODF) for instance will still be preserved when you highlight some text. I've seen some implementations out there that try to use shading as a substitute for highlighting, but that doesn't really work because people may also want to apply shading in addition to highlighting. For example, you may have a range of text shaded with light gray (ie the background-color is light gray), and then you want to highlight some of the text in that range. Then, once folks have reviewed the document, you want to remove the highlighting without removing the gray shading. In the ODF spec I saw support for shading on text, but not highlighting which we view as two different things (I only saw mention of highlighting on tables).

I came across this the other day while I was looking through the ODF spec and comparing it to the Ecma draft trying to get a better handle as to why the ODF spec was so much lighter (700 pages compared to 4000). I wanted to see if there were things we could do to reduce the number of pages in the Open XML spec without losing any of the necessary information. It looks like while there are some things that can be done for minor size reductions, we just have a lot more functionality and there is no way we could get it anywhere close to that small while still fully covering wordprocessingML, spreadsheetML, and presentationML. There are three reasons that we have so much more content. The first is that we are just representing a much richer set of features (since we have to XMLize all the existing Microsoft Office binary documents) so as a result there is just a lot more to document. The second reason is that the ODF spec points off to other specs for certain things to provide more details. The third reason is that the Ecma Open XML spec is just a lot more detailed as to how things work. The WordprocessingML sections are the furthest along in the latest draft, and if you read through the paragraphs and rich formatting section for instance (Section 19), you'll see what I'm talking about. The ODF spec on the other hand is very light and vague on a number of issues (like the numbering format issue I pointed out earlier).

-Brian

8 Comments
Filed Under: Office 2007

Word XHTML - Compliance and Styles

posted Tuesday, May 30, 2006 11:07 AM by BrianJones

This is the 2^nd post in a series by Zeyad Rajabi who is a program manager working on Word's XHTML output used in Word's new blogging feature.

My first blog gave a brief introduction on our XHTML output for the blogging feature in Word 2007. This post will outline details on the styles we output.

Goals

In my last post I said we wanted to be XHTML compliant by the time we ship this blogging feature. Today I want to be a little clearer as to what I mean: strict vs. transitional.

Working on Word I have come to understand Word’s HTML and CSS capabilities. Word only supports a subset of the standard HTML 4.0 specification and similarly only a subset of the standard CSS 1.0 specification. Yes, you read correctly, CSS 1.0. For the most part, the feature set we offer within the blogging tool allows us to output CSS properties that Word supports and can render correctly. However, there are a few examples where we are unable to output CSS properties (in order to be XHTML Strict compliant) because Word would not be able to read them in. All unknown HTML and CSS in Word are basically ignored, and it was a goal that the blog posts could be edited by Word after they are published.

What does that mean for our output?

At a minimum, our goal will be to always validate as XHTML 1.0 Transitional compliant code. For a basic blog we will validate as XHTML 1.0 Strict compliant code. For those blogs that use features where we cannot output Word supported CSS, our aim is to be XHTML 1.0 transitional compliant.

Word can certainly output any HTML or CSS, but the issue then is around roundtripping, which is the ability to generate HTML or CSS that can be read back in correctly. An obvious question would be to ask why Word can't just add the functionality to read those additional properties back in correctly. This would be great, but we are on a limited budget, and that would have meant taking away other features that we have prioritized higher. Because of this, there is a fine balancing act that we must perform: roundtripping vs. XHTML output.

XHTML Style Output

Feature	XHTML CSS Property	HTML Elements
Font	color font-family font-size text-decoration:line-through text-decoration:underline*	span span span span span
Block	text-align* text-indent*	p p
Background	background-color	span
Box	margin-left*	p
Table Padding	padding-top padding-left padding-bottom padding-right	td td td td
Table Borders	border-collapse:collapse border-top border-left border-bottom border-right	table td td td td
Position	width	col

CSS properties with * marked implies that we will output those XHTML CSS styles post Beta 2.

An interesting property that is missing is float. Unfortunately, Word does not understand that CSS property. Instead, we will use the HTML attribute align, which will make us XHTML Transitional compliant for the blogs with that type of content. We can output float, but if the post is ever read back into Word that property will be ignored, thus making the image not floating anymore.

Another interesting thing to point out is the styles we output for tables. As you can tell our HTML output for tables in Beta 2 is quite bloated. This table will certainly be updated as we get closer to release. I will post the complete spec of Word's XHTML support at a later date.

Suggestions are welcome

Anything missing?

11 Comments
Filed Under: Word HTML

Spreadsheet performance - Shared Formulas

posted Monday, May 29, 2006 4:05 AM by BrianJones

I wanted to follow up on the thread I started a couple weeks ago discussing the design goals behind spreadsheetML. There's a whole host of things we've done to make sure that the move to XML formats is a huge benefit to developers, without it actually having a significant negative impact on our end users. Moving our formats into an open XML format significantly increases the relevance and value of Office files because they can have more interactive roles with business processes and systems. If the files aren't open, then they can't fit into as many scenarios and we lessen their value. That's was the whole reason we started the move to XML formats so many years ago, because we wanted to significantly increase the value of Office documents.

The big key here though is to make sure that people will actually use the formats. If our users decide to stick with the old binary formats, then we lose out on that opportunity. So it's really important to make sure that for the average end user (who doesn't really care or put any thought into file formats) doesn't see any significant losses from moving into the new XML formats.

A big area for Spreadsheets is performance. It was really scary to move from the Excel binary formats that were so damned fast into an XML format that we knew couldn't match up. There was a lot of work though analyzing every aspect of file load times to make sure that we could keep the performance drop to an absolute minimum, and we actually have really done a great job. We are absolutely trying to design for the developer, but the priority is of course given to the end user experience when it comes down to difficult design decisions.

I already talked about a few things we've done like keeping tag and attribute lengths small. I also mentioned the shared string table and I'll go into more detail on the benefits of that later. Today I want to discuss some optimizations we've done around formulas.

Shared Formulas

Most folks who've done any type of spreadsheet work are familiar with formulas. More relevant to this post though, you are probably also familiar with using a common formula in a column to make a similar calculation for each row. For example, image the following table:

Product ID	Price	Quantity	Total
1	5.45	9	49.05
2	3.99	15	59.85

Most likely, that fourth column is a formula so that if you could look at the formulas, the table would look like this:

Product ID	Price	Quantity	Total
1	5.45	9	=B2*C2
2	3.99	15	=B3*C3

Now, for this case, it probably doesn't look like there are a lot of optimizations we can do around the formula storage. The XML for this table would look something like this (in shorthand):

<table>
<row>
 <c><v>Product ID</v></c>
 <c><v>Price</v></c>
 <c><v>Quantity</v></c>
 <c><v>Total</v></c>
</row>
<row>
 <c><v>1</v></c>
 <c><v>5.45</v></c>
 <c><v>9</v></c>
 <c>
 <f>=B2*C2</f>
 <v>49.05</v>
 </c>
</row>
<row>
 <c><v>2</v></c>
 <c><v>3.99</v></c>
 <c><v>15</v></c>
 <c>
 <f>=B3*C3</f>
 <v>59.85</v>
 </c>
</row>
</table>

The above XML representation would be pretty good if the table were this simple. Imagine though if this were actually more of a real world spreadsheet. Let's say this is tracking a large number of orders, so maybe you have more like 10,000 rows instead of just 2. Well that would mean that you have to parse that formula and figure it out 10,000 times. This would be a huge performance hit, especially if the formula was more complex.

You've probably noticed in Excel that if you type a formula into the first row and then copy it down through all the other rows, Excel can automatically adjust that formula in each row so that it's making the right calculation (ie you put =B2*C2 in row 2 and when you copy that into row 3 it says =B3*C3). Well, why shouldn't we do the same things in the file format? No reason to write out the formula 10,000 times and have to parse it 10,000 times. A complex formula can take awhile to parse, so we need to optimize around as little parsing as possible. Here is what the XML would actually look like using this approach:

<table>
<row>
 <c><v>Product ID</v></c>
 <c><v>Price</v></c>
 <c><v>Quantity</v></c>
 <c><v>Total</v></c>
</row>
<row>
 <c><v>1</v></c>
 <c><v>5.45</v></c>
 <c><v>9</v></c>
 <c>
 <f t="shared" si="0" ref="B2:B3">=B2*C2</f>
 <v>49.05</v>
 </c>
</row>
<row>
 <c><v>2</v></c>
 <c><v>3.99</v></c>
 <c><v>15</v></c>
 <c>
 <f t="shared" si="0"/>
 <v>59.85</v>
 </c>
</row>
</table>

Notice that now the formulas say they are "shared" and they have an id of "0" so that we know they match up. The first instance of the formula stores the actual formula value, as well as specifies the range that it applies to so that as the rest of the spreadsheet is parsed, you know that you can ignore those other cells and just apply the shared formula.

I admit that it makes development a bit more difficult, but as I said before, we kind of had to take the training wheels off and actually think more about the end users as well. There are hundreds of millions of users, and we need to keep them in mind. We definitely want these formats to be accessible to developers (otherwise we should have just stuck with the old binaries), and that's why we are taking the Ecma standardization so seriously. We need to make sure that we fully document these formats so that anyone can build solutions around them. We are also looking to folks in the openxmldeveloper.org community to help out here by building tools that work on top of the formats. This way we get the best of both worlds. We get a very fast format that's also easy to develop against!

-Brian

23 Comments
Filed Under: Office 2007, Excel

Numbering formats in ODF

posted Friday, May 26, 2006 4:24 AM by BrianJones

I was wondering if someone more familiar with the ODF spec could help me out. I had made the following reply yesterday to Alex's assertion that ODF was as feature rich as Open XML and I want to make sure that I'm not misstating things:

I think you might want to dig a bit deeper into the formats. ODF does build on existing industry standards, but at times they are partial implementations, and it still leaves out a lot. For instance, Open XML actually uses more of the dublin core metadata schema than ODF does.

Another easy example would be to look at the different types of numbering for a wordprocessing file. In Microsoft Office you can say that the numbered list should be "first", "second" and "third" instead of "1.", "2." and "3.". ODF doesn't support that.

That's just the beginning though. If you are from another country like Japan or China, there is absolutely *zero* mention for how your numbering types are defined. The spec only specifies:

- Numeric: 1, 2, 3, ...
- Alphabetic: a, b, c, ... or A, B, C, ...
- Roman: i, ii, iii, iv, ... or I, II, III, IV,...

No mention at all about what you do for any other language. If you use OpenOffice, they actually do support other languages, and they even save out those other numbering formats into the ODF style:num-format attribute. The problem though is that behavior isn't defined in the spec, so how does anyone else that wants to read that document figure out what OpenOffice's extension means? Maybe I'm just missing something, as the ODF spec is really vague in a lot of areas, but I looked around for awhile and couldn't find anything.

Even if you don't pay attention to the things that are just flat-out missing from the format, the documentation for the things it does support is pretty minimal. In the latest Ecma draft, we have about 200 pages discussing the syntax of formulas for spreadsheets, ODF has a few lines. That gives me the impression that no one that does accounting or works on Wall Street was involved in the standard because I can't really imagine them allowing it to go through without specifying how formulas should be represented. It's no wonder the few applications referenced as being "full implementations" of ODF aren't even capable of full interoperability (link).

Alex then replied with the following:

OpenDocument is well known to support variety of languages, and the Japanese ISO member pointed out a couple of problems with the spec. (mostly to do with international URIs). I think they would have noticed if numbering was a problem. The guys in the middle-east were looking at it too.

You're absolutely right about formulas; OpenDocument does not specify a syntax, and that is something the TC is working on. There is a wider problem here, though: formula syntax is something users know directly. Should OpenDocument do something new, or just what Lotus 1-2-3/Excel did/do? OXML has the luxury of only caring about compatibility with Office file formats; OpenDocument is designed to be widely compatible with all.

I may have jumped the gun when I stated that there was *zero* documentation, but I'm curious to know where in the ODF spec these things are specified. When I looked at the numbering section (4.3 , 12.2.2, 14.10.2) it was pretty light, and only called out those three styles I mentioned above. In section 12.2.2 there is reference to the approach used in XSLT for the format attribute, but it just says the attribute is done in the same way, not the same actual formats. The spec then states though that it only supports a specific set, and that it does not support all the different types the XSLT approach uses. The spec says that the number styles supported are ("1", "a", "A", "I", and "I"). Let's assume though that the spec was just worded improperly and it does in fact use the XSLT format approach to the full extent. Then why does OpenOffice output Japanese numbering format like this:

The XSLT spec says that you only put the first character of the list in the format attribute (or at least that's how I interpret it). I didn't see any mention of the approach of putting the first three characters followed by an ellipses.

That was using Kanji numbering. The XSLT spec actually does call out directly how to do Katakana numbering, and OpenOffice actually doesn't do that properly either (the XSLT spec says it should format="ア"). Instead, OpenOffice does this:

Now, for those familiar with Japanese numbers (and actually a whole host of other number styles) you know that it isn't always possible to represent a numbering style with just a single character . There are a couple different Kanji numbering styles that start with the same character (the difference is what you do once you get to 10). I assume that's why OpenOffice is going the route that it is.

Where is this approach documented though? Maybe I'm just misreading things here, and there is another portion of the ODF spec or the XSLT spec that allows for that approach? Or does this mean that if you are writing a Japanese document and use numbers with OpenOffice you aren't creating a valid ODF file? This newsgroup post implies that OpenOffice isn't yet fully supporting ODF so maybe that's the case? I suppose the response could just be that the format is extensible and you can place anything you want in that attribute, but how does that lead to interoperability? There's nothing to tell other applications how they should interpret that from as far as I can tell (again, I could be missing something obvious).

Almost every site I visit to find more information focuses almost completely on the marketing or political side of ODF. There are discussions around conformance, logo compliance, getting governments to support it, etc. etc. etc. I'm having a really hard time finding any good blogs or sites that discuss how to actually use it. I actually came across the oasis public mailing list archives that had some useful content, but I wasn't able to find anything about this issue.

-Brian

28 Comments
Filed Under: Office 2007

The 2007 Office system... already over 500,000 served

posted Thursday, May 25, 2006 4:40 PM by BrianJones

I saw this article up on betanews this afternoon saying that just 24 hours after going live, there were already 200,000 downloads of Beta 2. I thought that was pretty impressive and asked a few folks about it. It turns out that we're actually now over 500,000 and the curve is actually ramping up, and not flattening. Who knows where we'll be by next week. This is awesome!

The first draft of the Open XML spec is published; over a half million people have already downloaded the product that uses Open XML by default; and a brand new developer community is forming. This is your chance to be one of the first to work with what will most likely be one of the most widespread file formats in history. You can be one of the first! Let's get these discussions going over at openxmldeveloper.org. There are about 300 or so members right now, and that was before we even had the beta out. Share your solution ideas, and find out what other people are doing. I'm looking forward to the discussions... most of my friends could care less about file formats, but thankfully I have you guys. :-)

-Brian

2 Comments
Filed Under: Office 2007

Easy way to send us your feedback

posted Thursday, May 25, 2006 2:09 PM by BrianJones

There is now a really easy way to provide feedback as you are using Office 2007. Directly from Joe Friend's blog:

Sending Feedback

We have a great program called "Send a Smile" that provides you the opportunity to tell us what you think about the product. Don't worry you can "Send a Frown" too! Just download the SaS application and install it, then as you are using the product and experience something you love/hate you can simply click on the smile or frown in the notification area at the bottom right of your screen:

Next you'll see a dialog that includes a screenshot of the current application in use and a text entry box to include your comment. You can choose not to send the screenshot if you so wish or you can update it to make sure it shows the issue you're writing about. The more details you put in your comment the better. Click on the image below to see what this dialog looks like.

What happens to this information once you submit it?

The information (your comment and screenshot) is forwarded on to the Office team. We evaluate the comment/screenshot and make sure it gets routed to the correct team. Then we evaluate the information in order to determine if there are changes we need to make to the product. FWIW, we expect to get a lot of feedback, so it we can't respond directly to most, but we will be reading it as fast as we can.

So, download Office, load SaS and tell us what you're thinking!

-Brian

2 Comments
Filed Under: Office 2007

Beta downloads and One year anniversary

posted Wednesday, May 24, 2006 10:26 PM by BrianJones

Hard to believe it's only been a year since I first started blogging... so much has happened. It's really great that we went out so early with the news of the file format change, we started talking about it an entire year before we shipped a public beta. I think that helped to get a lot of the questions answered ahead of time, which is really important.

Speaking of the Beta, you may have noticed that the servers are really bogged down right now. A number of folks have had trouble getting the product keys. We are definitely aware of the problem, and there are a lot of people out here working to up the capacity. Sorry for the inconvenience.

-Brian

0 Comments
Filed Under: Office 2007

4000 pages of documentation

posted Wednesday, May 24, 2006 4:01 AM by BrianJones

There has been a great overall reaction to the news last week of Ecma's first public draft for the Office Open XML formats. One thing that is now absolutely clear to everyone that we are talking about an extremely rich and powerful set of file formats.

I think many folks didn't realize the amount of work we've had to take on, which explains why some had the false assumption that we could just use ODF. We were pretty clear in our response that it just wouldn't work for our customers because at the end of the day an open format is useless if the majority of our customers won't use it. That's why we had to make our formats fully support all the existing Microsoft Office files out there. If the formats didn't support all those features, then the only people who would use them are those that fundamentally want an open format; and everyone else would have just stuck with the old binaries. We absolutely did not want that to happen, we wanted everyone using an open format. We've invested a ton of resources into XML file formats because we believe it's a good thing, and we need to make sure that our customers will be willing to use them.

Let me be clear on a couple key points:

Rich format - Yes the format is extremely rich and deep and that's because it represents a very powerful set of applications that have evolved over many years and many documents. It would have been completely unacceptable for us to create a format that didn't fully represent the existing base of Microsoft Office functionality. If we had created some kind of subset format, many in the industry would have complained for very legitimate reasons. People would have complained that we were destroying fidelity with the key features they used, that we were hiding functionality, not enabling everyone to exploit the rich features, not encouraging the move to XML, etc. Bottom line – millions of organizations would have had a legitimate problem.
Extremely detailed documentation - It's funny but I've actually seen people complaining that there is too much documentation. The documentation is essential, even if there are parts that are not used by everyone. I personally think we have to provide documentation on every aspect of our format, otherwise how do you know what something means? This is a lot of work, and I believe it's absolutely necessary. I can't imagine there being a benefit to anyone from not documenting something.
Full implementation - I don't think it should come as a surprise that with the rich set of features in Office, it's going to be a lot of work to build an application that can support all of that functionality. In the past, people had said that the reason nobody could build an application that matched Office was that the formats were locked up. Well, the format information was available, but not for all the many purposes that we are enabling now. Now all those people should be happy because the format information is complete enough to enable a full understanding by everyone. It's up to those other applications though to decide what level of support they want to build. While I think interoperability is possible, the struggles that the applications supporting ODF are having show that it's really a lot of work even for a format that isn't as deep. This is often to be expected though because the different applications have different sets of features as well as different implementations of the same features. That is how things work.
Partial implementation - Now, if you don't care about fully matching Office features, then anyone can choose to just support a subset of the format. You can implement as much or as little of the format as you want. You can build an application that does nothing more than adds comments to a WordprocessingML document; or that automatically generates a rich SpreadsheetML file based on business data. It's up to you. The information is all there to use in the way that best benefits your application.
Room for innovation - Now that all the features we've stored in our formats are fully open and documented, people are free to build with them. In addition to the fact that you can implement as much or as little of the format as you want, you are also free to add more to the format. The formats are completely extensible. You can add your own extensions to the format, or you can even join Ecma and propose that those extensions get added to the official Ecma standard. The strong support for custom defined schema in Office gives you a lot more power than what a document format on it's own would give you, through integration of your own parts.
Microsoft does not own the standard - We no longer own these formats, Ecma does. I know there is still concern out there that these formats could change out from under you, but that's not something that Microsoft can do. Ecma fully controls it, and once it goes through ISO, it will be even more solid and locked down.

I'd also like to reuse some information that I left as a comment in my post last week. Some people were a bit confused on how you could create a standard that was so rich and had all the backward compatibility with the existing base of Microsoft Office documents. It was even suggested that almost as a way of leveling the playing field we should choose just a subset of features that we think everyone can build applications for. This would be a great move for our competitors but a horrible move for our customers. Adam provided a lot of feedback and I really appreciated that he took the time to write all that up. Patrick and Biff had some really great replies that tried to explain why backward compatibility was so important. Here is the reply I left for Adam that I hope helps really clear up his questions around why we went the standardization route in the first place:

Hey Adam, thanks for taking the time to get all your thoughts down. It definitely has helped me understand where you are coming from.

It sounds like you understand that from our point of view, in order to use an XML format as the *default* format for Office it needs to be 100% compatible right? I think you're point is more that we should also have an optional format that is more basic and doesn't necessarily have 100% of the features covered. That smaller more basic format would then be the one that should be standardized. I think that's what you are saying.

Based on your description, the format you desire sound a lot like HTML. HTML is a great format for basic interchange. It doesn't support everything that is present in an Office document, but as you said, that isn't always desirable. We've supported HTML for quite awhile, although we took the approach of trying to have our cake and eat it too when we attempted to make our HTML output support 100% of our features. The result was an HTML format that had a ton of extra stuff in it that many of the people who just wanted HTML didn't really care about (and it just got in the way).

Our primary goal this release with the formats was not to try and re-implement HTML, but instead to move everyone over to using XML for all of their documents. Let's talk about the motivations for what we are doing with Open XML since that was the main point of your question:

The reason we've spent the past 8 or so years moving out formats toward a default XML format is that we wanted to improve the value and significance of Office documents. We wanted Office documents to play an important role in business process where they couldn't before. We wanted to make it easier for developers to build solutions that produce and consume Office documents. There are other advantages too, but the main thing is that Office documents are much more valuable in just about every way when they are open and accessible.

The reason we fully document them is the exact same. We need developers to understand how to program against them. Without the full documentation, then we don't achieve any of our goals I stated above. The only benefit would be that other Microsoft products could potentially interact with the documents better (like SQL or SharePoint), but that doesn't give us the broad exposure we want. That would be selling ourselves short. We want as many solutions/platforms/developers/products as possible to be able to work with our files.

The reason we moved to the "Covenant not to sue" was that a number of people out there were concerned that our royalty free license approach wasn't compatible with open source licenses. Again, since the whole reason for opening the files was to broaden the scenarios and solutions where Office documents could play a role, we moved to the CNS so that we could integrate with that many more systems. Initially we'd thought the royalty free license just about covered it, but there was enough public concern out there that that we decided we needed to make it even more basic and straightforward. We committed to not enforce any of our IP in the formats against anyone, as long as they didn't try to enforce IP against us in the same area. No license needed, no attribution, we just made a legal commitment.

The reason we've taken the formats to Ecma for standardization is that it appeared that a number potential solution builders were concerned that if we owned the formats and had full control, we could change them on a whim and break their solutions. We also had significant requests from governments who also wanted to make sure that the formats were standardized and no longer owned by Microsoft. Long term archive-ability was really important and they wanted to know that even if Microsoft went away, there would still be access to the formats. We were already planning on fully documenting them, but the Ecma standardization process gave us the advantage of going through a well established formal process for ensuring that the formats are fully interoperable and fully documented. It's drawn a lot more attention to the documentation as well so I'm sure we'll get much better input, even from folks who aren't participating directly in the process.

I hope that helps to clear it up a bit. It really is just as simple as that. Any application is free to implement as little or as much of the format as they wish. If you really want every application operating on a more limited set of features, that isn't as much of a format thing as an application thing. You would need to get every application to agree that it will not add any new features or functionality, and will disable any existing functionality that the other applications don't have. That wasn't our goal. Our goal was to open up all the existing documents out there, and then anyone who wants to build solutions around those formats is free to do so. In addition, anyone is free to innovate on top of the formats, as I believe there is still a lot of innovation to come. The formats are completely extensible, so if someone wants to use the formats (or parts of the formats) as a base and build on top of that, they can do so as well. They can even join Ecma if they want and propose to add those new extensions to the next version of the standard.

-Brian

24 Comments
Filed Under: Office 2007

Brian Jones: Open XML Formats

News

Archives

Post Categories

Office '12' Blogs

Interoperability by design

Learn more about Word 2007's support for seperating data from presentation

TechEd Boston

Word XHTML - Mapping styles to semantics

Styles

Custom Styles

Comments are welcome

Thoughts on Open XML in ISO

Follow-up on PDF legal issues

ISO 19005-1 compliant PDF/A

Pluggable architecture

Promoting XPS?

Closing...

Legal issues around PDF support

Highlighting in a document

Word XHTML - Compliance and Styles

Goals

XHTML Style Output

Spreadsheet performance - Shared Formulas

Shared Formulas

Numbering formats in ODF

The 2007 Office system... already over 500,000 served

Easy way to send us your feedback

Sending Feedback

What happens to this information once you submit it?

Beta downloads and One year anniversary

4000 pages of documentation