Orcmid's Lair status 
privacy 
 
about 
contact 

2007-02-11

OOX-ODF: The Danger of Finding Only What You're Looking For

One problem that scientific investigators must always be cautious about is finding what we are looking for because that is all we are able to see.  This has tripped up many an “objective” person.  I confess to getting that particular pie in my face often enough to be very careful.  I’ll not be surprised when it happens again. I do promise to correct whatever is necessary.

This human condition is amply demonstrated when we are making abstract attributions and speculations about the actions of others.  It is easy to see all conduct as evidence that justifies the attitude we already bring to the party.  Our spectacles are already smeared with the stain of our own prejudgments.

It is easy to see how this arises in the way everything Microsoft does or doesn’t do is constantly explained by the corporation’s malevolent intentions.   This applies to the simplest acts and some silly and sometimes completely stupid moves.  But the indignant proclamations of further evidence of evil intent continues.  They are not even “convicted” of being a monopoly — that’s not illegal, but that is often said in justifying every suspicion. 

Similarly, to consider that the behavior of IBM as a corporation and of a few visible IBM employees in opposing the promulgation of Office Open XML (OOX) to be a considered, calculated business maneuver is, to my mind, giving IBM far too much credit.  There is considerable risk of potential embarrassment (and worse) were such machinations substantiated by non-repudiatable facts.  There are too many other explanations that fit the observed behavior, especially for all of the years the IBM internal echo chamber has had to establish the received wisdom that Microsoft did them wrong (out IBM-ing IBM, more-or-less, for those of us with long memories).  I’m willing to believe that it is really personal.  I am not prepared to go farther than that.  There are lots of agendas here.  I doubt that one can isolate and confirm a single one to fit all of the conduct that we see.  The same goes for Microsoft when one employee or another lashes out in some particular way.

The so-called evidence for contradiction in OOX that has been compiled at Groklaw is another example of the lengths we can go when we are too happy with our findings and are careless with over-reaching interpretations of facts that are open for anyone to inspect more carefully.   There may be a pony in there, but the readiness to accept blatant nonsense tends to smear the pony with manure.  I’ll pick the first example that came to my attention, because it is so clear-cut.  (It is not my purpose to engage in an extensive analysis of complaints about OOX, I’m looking at where attitude leads to credulity.)  After that I’ll turn to a more-recent sequence of extrapolations that are even sillier, especially considering how easy it is to check.

{tags: }

You Say Bright Green, I say Chartreuse

On January 28, Sam Hiser blogged about a problem with the way colors are handled in OOX.  The summary statement in his blog feed was pretty direct:

Yoon Kit Hasan Saidin over at the Open Malaysia blog have a chrystal clear view of one of the ways Microsoft Office Open XML contradicts existing global technology standards. It ignores the WC3 SVG standard.

On the blog page there’s a little table with two different sets of color mappings for a single set of color names (dark blue, dark cyan, etc.).  I first thought the difference might have to do with the choice of standard web color codes, but the OOX choices don’t completely line up with those either.

After looking at Yoon Kit Hasan Saidin’s article, I concluded two things (in a long-winded comment), and Yoon Kit has replied.  What seems to have gotten lost in the discussion, and I am certainly at fault for not writing more crisply, is this:

  • OOX does not implement SVG.  We all knew that.  However, the color names used in the DrawingML portion of the specification line up with and match perfectly with the colors defined in the SVG specification.  There is no disparity in the supported DrawingML colors with those of SVG.  None.  Niente.  Full stop.  Period.
        
  • The only difference is in how the colors are given names (all in English, all in coded attribute values).  OOX DrawingML abbreviates the prefix “light” to “lt” and “dark” to “dk”.   These are for codes conveyed in OOX nomenclature as values of particular attributes defined under the DrawingML schema and namespaces.  They are rigorously defined and their red-green-blue levels are given and are in complete accord with SVG mappings.  Although they are completely harmonized by their agreement on red-green-blue levels, and easy correspondence of names, no one should treat the attributes as directly substitutable into SVG.  The mapping is trivial, but there is no pretense that the attributes are the same.  That’s not how XML works and that’s not ever suggested in the OOX specification. 
       
  • Keep in mind that this has nothing to do with how the colors are presented to users or named in displays which may well be internationalized and also given different expressions for what appears, in the file, as one of the allowed XML attribute values.    
      
  • There is almost no likelihood of confusion by implementors who are working from the OOX specification and this is not material that end users are likely to see in literal form.  They might, but that is about presentation and applications can be arranged to do what is most appropriate to serve their user community well.

Well, where did those widely different color mappings in the blog-post table come from?  They arose by taking the codes for oranges and lining them up with codes for apples (sorry).  The codes shown for OOX mappings have nothing to do with the SVG names nor the use of colors in DrawingML.  The OOX ones being compared with the identified SVG colors are the OOX colors for highlighting over text.  OOX has a limited set of highlighting colors for use in Wordprocessing ML and only in WordprocessingML.  They are also named by fixed, rigidly-defined English-language text strings in attribute values.  The red-green-blue levels that those attribute values correspond to are also rigorously and clearly defined.  Some happen to be different mappings than ones used for similarly-spelled (but different-)attribute values in DrawingML. 

The highlight-color attributes are different attributes with different use, and there is no apparent intention to correlate them with the same names when used for the SVG and DrawingML colors.  These never show up in DrawingML.  Similarly, implementing them correctly is trivial when following the specification.   And, in fact, not even Microsoft Office Word 2007 uses all of the same names in its English (US) interface for selection of highlight colors. 

The attribute values are for technical coding of colors.  They are not about what users see or what the colors might be called by different users.  OOX doesn’t prescribe any of that.  You might have done this differently, and I might have also, given a blank sheet of paper to start with.  It doesn’t matter.

This is not rocket science. Yet the declaration of Yoon Kit, parroted by Sam is “MSOOXML contradicts W3C SVG Colour definitions” and the comparison in the table between SVG colors and the highlight colors in OOX WordProcessingML is simply bogus.

What I say: No harm, no foul.  Interesting way to learn more about how OOX is specified.

The Mysterious Document Updates

[updated 2007-02-12T07:15Z to smooth out some bumps and account for a third flavor, the single 5-parts-in-one PDF that was apparently submitted to ISO.]

On February 8, Rob Weir posted about “Here Today, Gone Tomorrow.”  Here;s the gist of it:

The Ecma OOXML web site has been updated. The version of the OOXML specification which was submitted to JTC1 is not longer there. Instead we have a new version, generated on February 1st. I have no idea if the content of the new version differs in any substantial way from the older version, but it is clear that the pagination is different. So page number citations, as referenced in this blog and other places (such as the Groklaw analysis) are now incorrect.

(Why don't I cite using section numbers? Good question. This is because the version of OOXML submitted to JTC1 reused section numbers, so a reference to "section 3.4.2" could be ambiguous.)

Of course there are hilariously speculative comments to go with the full post.  And Microsoft’s Brian Jones is now archly pestered by some messenger of joy leaving innuendos on his blog and challenging him to explain what happened.

Let’s take the section number issue first. It is true that the ECMA specification is in 5 sections parts (with some additional materials, such as the schema files too).  This is typical of ISO documents, and was true of the October Technical Committee draft that was submitted and subsequently approved by ECMA in December 2006.  How does one reference section numbers from outside the document set when each part starts over from section 1?  What I do is say what part and, when I am making lots of references I simply put the part number in front of the section.  E.g., section 4-2.18.46 is the section (in Part 4) where the WordprocessingML highlight colors are defined.  I didn’t think this was a new problem in doing careful analysis and commentary on ISO specifications.  But I haven’t had to deal with any of them in several years.

I have a different beef with the section numbering in the OOX specification.  The tables of contents doesn’t don’t go deep enough.  The PDF produced from the OOX .docx of the specification has hyperlinks in the tables of contents and in cross-references, but the TOCs is are too shallow.  It is a bloody pain to get to 4-2.18.46 by scrolling from the top of 2.18.

Now, about the repagination.  Rob is a smart guy, and he has gone through the big section part 4 with a fine-toothed comb.  I think he could have figured this out easily without leaving such a great opening for the speculation of mischief.  It strikes me as the professional responsible thing to do. 

I’m going out on a limb here, because I have no idea what documents were physically delivered to ISO JTC1 as the ECMA submission.  What I have in my possession are the files for the December 2006 ECMA-376 as they were on February 9, 2007 when I downloaded them.   Of these, the Part 4 DOCX (in a Zip file with other material) version is corrupted (a problem I sometimes have with some documents from some sites), so I couldn’t use Word 2007 to compare it with any earlier edition.  I also have the final TC45 Drafts that were created in October for comparison.  Here is the result of my explorations.

[update 2007–02–12T01:50Z: I got it! 

First, there is no question in my mind that the the current downloads constitute the official ECMA-376 documents, with all of the Final Draft title pages corrected and an official, overall cover (two sheets) attached to the front of Part 1.   I am satisfied that this is precisely the same content that was approved by ECMA and also used to form the ISO submission.

  • Thanks to a cache page provided by Stephen G. Johnson, I see that there was apparently a special ISO-submission document prepared from the 5–part Final Draft in the form that was balloted and approved at ECMA (and preserved in the current download, with the kind of modifications I identify below). 
       
  • That’s unfortunate.  It seems that the document that was reviewed by Rob Weir and others as the submission to ISO is a single 6,039 page, 46.9MB PDF.  It has the five parts (without the other useful attachments, such as versions of the schemas) simply pasted together consecutively, but still having the original tables of content and the separate on-the-page numberings that apply to the individual parts.  
        
  • The only differences that I have observed are the addition of the ECMA cover pages (also now on the front of the current Part 1 download, the omission of the title page for Part 1, and no other changes. 
      
  • As a document to review, the single-file document is very unsatisfactory.  First, there is no overall table of contents.  Each part still has its own table of contents and page numbers, but you have to find the parts on your own.  In addition, none of the hyperlinking that works in the .docx versions and the corresponding PDFs are preserved in the single-PDF version. 

Since I had been downloading the drafts as they were announced as available from ECMA TC45, I was not particularly concerned about obtaining replacements for my copies of the final drafts.  I lucked out.  When I submitted comments to ECMA, I was able to use those far-more-manageable drafts from October and also earlier.  Stephen Johnson has also provided a similar collation of the current downloads.  I really don’t recommend using that.   The five separate parts are superior in every respect and they are the superior versions for review, analysis, and discussion of comments.

I have no idea what procedural requirements led to the submission of a single PDF file.  I hope that people were advised of the availability of the separated parts in easier-to-review form.]

Differences in the Part 4, Markup Language Reference PDF.  This is the big honker that you dive into when you want to see the precise details of all of the attributes and formats of OOX.  Here’s where those who are looking for it find buried treasure and smoking guns.  So what’s the difference?  Well, the PDF that was created on February 1 has one more physical page in it than the PDF that was created on October 6, 2006.  That is the blank page after the front title page.  The page is counted in the page numbering, it just wasn’t physically present in the October 6 version.   I can’t do a line by line or word for word comparison, but I can tell you that the numberings on the pages are identical and the table of contents (and the pages referenced in the table of contents) are identical.  When I obtain the DOCX version of the latest downloads, I will have Word 2007 compare them and show me the changes.  Meanwhile, if page references are to the numbers on the pages (and not the sequential page-count positions displayed by PDF), there seems to be no problem.  And even if the PDF page-count numbers were used, the new document’s PDF page-count numbers for the main section are simply greater by 1 for the same numbered page.  [updated 2007-02-11T11:33Z in a poor attempt to distinguish between the numberings PDF shows by counting the pages that are there, in sequence, and the numbers that are printed on the pages.]

Differences in Part 1, Fundamentals.   I know that older versions of Word can compare two documents and synthesize what are the additions and deletions between one and the other.  Versions since at least Office 2003 also provide sidebar annotations that explain changes, making it easy to scroll through all of them.  There are other ways to navigate from difference to difference as well.  I’d seen a post that suggested that Office 2007 does this even better.  So what a great test: doing some document forensics on the OOX specification itself using the current closest implementation. 

I started with Part 1 for no good reason other than it was first.  Using the .docx files now, the final committee draft is dated October 9 and it has 173 pages in the file.  These are pages i (un-numbered cover) through viii (end of the Introduction), followed by 166 pages of the main document text. 

The latest download (dated January 25, 2007 in the file itself) has 178 pages in the file.  These are page i (un-numbered cover) through xii (end of Introduction), followed by 166 pages of the main document content.  The tables of contents are the same, and the numbered pages of the main text are the same.  I mean the same.  There is no material difference in content.  The pagination on the pages themselves has not changed at all.  (These are all formatted for 8.5” by 11” paper, as are the PDFs, by the way.) 

The difference in the front matter is the addition of two new cover sheets (with blank backs) and modification of the original part 1 title to simply introduce the part and remove text that applied only in the TC45 draft.  There are other differences.  They are immaterial.  They seem to be entirely involved with styles in the tables of content, fields that produce titles over the pages, and formatting of bulleted lists.  The result seems to be indistinguishable, but Word 2007 says there was a change.  I saw nothing different outside of the 4 new pages of front matter and the edits to the original title page.

This is not an exhaustive comparison.  Anyone who thinks they might find something significant is welcome to do more. 

Maybe ECMA could use stronger document controls and account for re-issues better.  But there appears to be no difference and the material is on the same-numbered pages that it has been since October.  I suspect ECMA might feel they are insulted, though.  And rightly so.

My provisional assessment: No harm.  No foul.  Waste of time.  On further reflection: some lessons about how to cross-reference documents and also on how to provide some document engineering controls so people know what’s what.

 
Comments: Post a Comment
 
Construction Zone (Hard Hat Area) You are navigating Orcmid's Lair.

template created 2002-10-28-07:25 -0800 (pst) by orcmid
$$Author: Orcmid $
$$Date: 07-02-15 8:51 $
$$Revision: 1 $