Orcmid's Lair status 
privacy 
 
about 
contact 

2007-02-07

Latest OOX-ODF FUD-Spat: States Prepare to Ban Zip and PDF Files

It seems that a new early-Spring pastime is watching State legislatures come up with bills that are designed to secure and preserve the digital documents of civil administration in open formats.   These initiatives (among thousands of bills that are put in the hopper in each State’s legislative session every year) come and go, except for the Commonwealth of Massachusetts where the serious work of mandating such a transition made it into law and is (quietly?) underway.  Other State governments might want to wait a while until Massachusetts gains some serious experience over the course of this year and probably into 2008 as well.  That would be the prudent, taxpayer-sensitive approach.

News of these new governmental undertakings is usually accompanied by enthusiastic rejoicing.  It is proclaimed how the Microsoft-originated Office Open XML format for document interchange (OOX here) can never qualify, har, har, hardy har.  It is conveniently over-looked that various, if not all, ODF implementations are disqualified using precisely the same rationale.

{tags: }

In this week's exciting developments, Bob Sutor points to Elizabeth Montalbano’s InfoWorld article: Texas, Minnesota eye move to ODF.  The body of the article does admit that ODF (the OASIS/ISO Open Document Format) is not mentioned in the draft legislation, nor is any specification or standard cited in the drafts.  One sponsoring State Senator in Minnesota confesses that ODF is indeed intended to be the only solution.

Montalbano’s coverage is well-balanced.  I recommend it as a source on these early 2007 legislative efforts.  There are links to the specific legislative proposals, as currently written, and that’s a rare demonstration of good journalism when digging into these controversies.

  Minnesota H.F. No. 176 (85th Legislative Session) is succinct, requiring an “open, XML-based file format, as specified by the chief information office of the state.”  The explanation of what qualifies for such a format is given in four points:

  1. interoperable among diverse internal and external platforms and applications;
  2. fully published and available royalty-free;
  3. implemented by multiple vendors; and
  4. controlled by an open industry organization with a well-defined inclusive process for evolution of the standard.

This document is published on the web in XHTML (1.0 transitional) and Adobe PDF format.

The Texas offering is published on the web in HTML 4.01, Adobe PDF, and (ahem), Microsoft Word (apparently as a Document Template, not an ordinary document) format.  It is recorded as Texas Bill 80(R) SB 446.  This bill also calls for (pay attention now), an “open, Extensible Markup Language based file format, specified by the department, that is:

  1. interoperable among diverse internal and external platforms and applications;
  2. published without restrictions or royalties;
  3. fully and independently implemented by multiple software providers on multiple platforms without any intellectual property reservations for necessary technology; and
  4. controlled by an open industry organization with a well-defined inclusive process for evolution of the standard.

I thought you’d like to see exactly what Montalbano politely notices as “similar wording to describe the file format the states intend to support.”  Uncanny, isn’t it?  We can be thankful for open government and the Internet.

At the moment, I don’t think anyone, depending on the level of State-sponsored scrutiny and accountability, can hurdle the bar written into Texas item (3), likely to be rewritten for that reason alone, so I will stick with the simpler common wording from Minnesota.

Before continuing, I want to point out that I am fully in favor of civil authorities adopting standardized open formats, with appropriate profiles for their use among qualified products for use in interchange and preservation of public documents that arise in the course of civil administration.  My testimony for unreserved opening of Microsoft document formats (not merely for governmental use) is a matter of public record. 

There have been some promising legislative proposals for adopting open standards for digital documentsThese two are not among them.  [update 2007-02-08T00:03Z Microsoft’s Brian Jones has posted a wonderful analysis of what Texas could have said it was out to accomplish.  He provides a positive reading in contrast with my nit-picking.  Brian’s positive interpretation of these legislative adventures, along with the positive elements of an earlier Minnesota bill deserve careful attention.  I think there are goals that could be aligned on and also made actionable in a way that fosters interoperability and encourages competition.]

Let’s talk about the language of criterion (1) and a minor little technical problem with criterion (4). 

First, document formats don’t interoperate.  Sorry, they don’t do that.  Computers and software systems interoperate.  With luck, the document formats are usable in interchange and can be migrated from one platform to another. There, they can ideally be accessed/manipulated in some fashion by different suitably-coordinated applications so as to fulfill some useful mutual purpose of the human participants in the scenario. 

Furthermore, without applications meeting some suitability-for-the-task criteria, nothing about supporting a common format guarantees, of itself, that the applications will preserve whatever it is the task requires in order to be successful.  Round-trip editing of arbitrary office documents, for example, is a challenge to the applications.  The format may be an enabler, but support of a common (standard or not) format does not ensure fulfillment of the requirement.

And finally, none of the standard formats can pass (4) whether or not the ECMA Office Open XML specification ever becomes the ECMA/ISO Office Open XML standard.

This is the hippopotamus under the carpet: ODF and OOX both depend on a non-XML, privately-specified, binary format for their operation.  They both use a Zip archive as a package for all of their XML parts and other content.  The OOX *.docx file and the ODF *.odt files, for example, are Zip files with special filename extensions.  Taken literally, both of these over-specific legislative proposals would prohibit the use of Zip, a proprietary, privately-defined format and it is not clear how ODF and OOX could be adopted in their Zip-packaged form.  The legislation would also ban Adobe PDF in any of its current instantiations.  This is a wonderful demonstration of the ridiculousness of using legislation to implement a technology agenda.  [update 2007-02-08T03:30Z Thanks to Bill Anderson for requesting clarification of why Zip matters in his comment, below.]

Now, we can agree that Zip has all of the qualities of a de facto standard.  But this legislation can’t honor a de facto standard.  It can’t even allow PDF, even although Adobe is now in the process of submitting PDF to AIIM for ultimate AIIM/ISO standardization, because PDF is not based on XML.  Sorry.  There’s just no way to allow a de facto standard, without leaving room for adoption of OOX and even Microsoft Office binary formats.  And of course, naming the enemy in an act of legislation would be an act of legislative self-immolation (and probably be regarded, with or without the immolation, as illegal to boot).

The OASIS Open Document Format specification (version 1.0 and the newly-approved version 1.1) include the following concession to Zip (section 17.1):

As XML has no native support for binary objects such as images, [OLE] objects, or other media types, and because uncompressed XML files can get very large, OpenDocument uses a package file to store the XML content of a document together with its associated binary data, and to optionally compress the XML content. This package is a standard Zip file, whose structure is discussed below.

Notice the little “standard Zip file” twist.  I love the concessions to OLE objects and DDE (both Microsoft protocols) that are wired into the ODF specification.  We didn’t need to specify spreadsheet formulas, but by golly we’ve got OLE covered.  Also, if you think there are no binary chunks in an ODF document, stop kidding yourself.

Meanwhile, there is no profile about the precise aspects of the Zip format that must or must not be present for ODF.  There is a Zip hack though.   It is accomplished with this language (from version 1.1, section 17.4):

If a MIME type for a document that makes use of packages is existing, then the package should contain a stream called "mimetype". This stream should be first stream of the package's zip file, it shall not be compressed, and it shall not use an 'extra field' in its header (see [ZIP]).

The purpose is to allow packaged files to be identified through 'magic number' mechanisms, such as Unix's file/magic utility. If a ZIP file contains a stream at the beginning of the file that is uncompressed, and has no extra data in the header, then the stream name and the stream content can be found at fixed positions.

 The use of “should” and “shall not” is technical standardese and in this particular place means that no processor of ODF can count on this being there, but a producer of ODF had better provide it.  Of course, there is no means to ensure that “standard Zip file” processes might not break this arrangement.

The Microsoft approach for ECMA Office Open XML (officially “ECMA-376 Office Open XML File Formats” [corrected 2007-02-08T02:16Z]) is based on the Microsoft Open Packaging Conventions (OPC) defined in Part 2 of the specification.  There is an abstract packaging model that could be implemented in a variety of ways, including on servers containing humongous documents that one might very much want to use only individual parts from and also be able to access collaboratively in interesting ways.  The attention to distributed access of packages and their parts is just one of the features that I find very interesting in OPC.   To support distributed use, there is a Pack URI scheme that allows for pretty arbitrary cross-referencing within, among, and into an OPC package.   The OOX specification also specifies a physical mapping of OPC to Zip, referencing “the Zip specification.”  This mapping even allows producers to leave growth space in the Zipped parts so that some update-in-place is possible.

Appendix C of the OPC portion of the Office Open XML specification provides a comprehensive, 11–page profile of the appropriate use of Zip as a physical mapping of an OOX document.  The information is in reference to a PKware Appnote.txt document.  Unfortunately, I cannot find a proper citation of that document anywhere in my copies of the OOX specification.  It’s a mystery.  [update 2007–02–11T07:23Z I found the citation.  It is in the Part 1, Appendix A Bibliography.  The entry gives the URL of the PKware site without any indication of the version of the Appnote that is relied upon.]  There is a PKware page on the subject [AppNote], and one must trust that this is applicable for OOX.  PKware do refer to themselves as “the creator and continuing innovator of the Zip standard.”

How unfortunate that Zip format doesn’t satisfy certain high-minded criteria for qualification as an open standard! 


[ZIP] Info-ZIP Application Note 970311, ftp://ftp.uu.net/pub/archiving/zip/doc/appnote-970311-iz.zip, 1997.
     What’s beyond cool for this reference from the ODF specification (apart from it being in a Zip archive at a decidedly unofficial location), is this disclaimer: “This file is based on PKWARE's appnote.txt of 15 February 1996.  It has been unofficially corrected and extended by  Info-ZIP without explicit permission by PKWARE.”  Then there’s this: “PKWARE therefore expressly disclaims any warranty that the information contained in the associated materials relating to the subject programs and/or the format of the files created or accessed by the subject programs and/or the algorithms used by the subject programs, or any other matter, is current, correct or accurate as delivered. … Furthermore, the information relating to the subject programs and/or the file formats created or accessed by the subject programs and/or the algorithms used by the subject programs is subject to change without notice.”
      This is, of course, the hammer that some insist Microsoft will drop on OOX after everyone drinks the Cool-Aid.

[AppNoteAPPNOTE.TXT – .ZIP File Format Specification, version 6.3.0, September 29, 2006.  Available at <http://www.pkware.com/documents/casestudies/APPNOTE.TXT>, accessed on 2007-02-07.
     If you don’t think this is a proprietary format, consider that the first paragraph provides this injunction: “The use of certain technological aspects disclosed in the current APPNOTE is available pursuant to the below section entitled ‘Incorporating PKWARE Proprietary Technology into Your Product’.” 
     Although there are disclaimers of the kind already mentioned in the case of [ZIP], there is this additional important statement of intent: “This specification is intended to define a cross-platform, interoperable file storage and transfer format.  Since its first publication in 1989, PKWARE has remained committed to ensuring the interoperability of the .ZIP file format through publication and maintenance of this specification.  We trust that all .ZIP compatible vendors and application developers that have adopted and benefited from this format will share and support this commitment to interoperability.” 
     It would appear that the commitment to interoperability has been quite successful. 

 
Comments:
 
Orcmid, thanks so much for this description, you have helped me begin to understand the goals, the intents, the breakdowns, and the high-minded rhetoric surrounding these issues.

I am with you 100% on the interdependenc of specifications, applications, and work practices to insure a workable and practicable open and long-lasting format for digital documents, public, private, and otherwise.

I got confused when you talked about the "hippopotamus" regarding the binary element in both ODF and OOX. I'm sure I'm missing something obvious, but what exactly is the ".ZIP" dependence that can't be avoided?

I did notice in conversations at the OpenRepositories conference last month in San Antonio that people claim complete rendering of any document format in XML without mentioning, e.g., how they manage to represent, say, a four-hour movie.

This is a very good contribution to whatever discussion I may have here in Texas regarding these kinds of legislations and regulations.
 
 
Oops! You'r right.

Here's the magic connection between Zip, ODF, and OOX. Both ODF and OOX use Zip as a container for all of the XML bits. There are multiple XML parts (and other goodies, such as images), and Zip is used to package them all together. So the ODF *.odt file and the OOX *.docx file are both Zip files with different extensions.

There is actually a single-XML-file serialization in the ODF specification that I don't think anyone uses for serious documents. It takes a little work to get something like OpenOfice.org to spit one out. It is useful if you are studying the format, but not so useful as a practical interchange form.
 
 
I came across your blog by clicking a link.

orcmid said "I love the concessions to OLE objects and DDE (both Microsoft protocols) that are wired into the ODF specification. We didn’t need to specify spreadsheet formulas, but by golly we’ve got OLE covered."

Perhaps it is not obvious to you that properly storing and handling OLE and DDE is a critical part of whatever competing product must do to render a document accurately (on screen, or to a printer).

Those features are shared across all Office documents, and are of much higher priority than specifying a particular feature of just one of the applications, namely spreadsheet formulas in Excel.

And whether those formulas are specified or not is simple word quibbling. They may not be (I am not sure of that, I haven't bothered to check the final specs), but it is obvious that, in order for OpenOffice, KOffice and many other products to compete with MS Office, they have to reproduce the semantics of said formulas. Thus, Excel's run-time behavior is the specs you said are missing.

Your reasoning is bogus.

I have only taken one sentence in your post. There are MANY others just as questionable.

I don't know where you're at, but clearly a beginning would be for you to understand the difference between a product that only does read/write, versus a product that renders.

-Stephane Rodriguez
http://xlsgen.arstdesign.com
 
 
Also, if your name goes by Dennis Hamilton (apologies if I am wrong), I have left a comment for you here :

http://blogs.msdn.com/craig/archive/2007/02/07/selective-openness-and-strange-profane-rants.aspx

-Stephane Rodriguez
http://xlsgen.arstdesign.com
 
Post a Comment
 
Construction Zone (Hard Hat Area) You are navigating Orcmid's Lair.

template created 2002-10-28-07:25 -0800 (pst) by orcmid
$$Author: Orcmid $
$$Date: 07-02-15 8:51 $
$$Revision: 1 $