Professor von Clueless in the Blunder Dome: 2005-10-16

Professor von Clueless in the Blunder Dome

privacy

Hangout for experimental confirmation and demonstration of software, computing, and networking. The exercises don't always work out. The professor is a bumbler and the laboratory assistant is a skanky dufus.

Blog Feed

Recent Items

Republishing before Silence

Command Line Utilities: What Would Purr Do?

Retiring InfoNuovo.com

Confirmable Experience: What a Wideness Gains

Confirmable Experience: Consider the Real World

Cybersmith: IE 8.0 Mitigation #1: Site-wide Compat...

DMware: OK, What's CMIS Exactly?

Document Interoperability: The Web Lesson

Cybersmith: The IE 8.0 Disruption

Cybersmith: The Confirmability of Confirmable Expe...

visits to Orcmid's Lair pages

The nfoCentrale Blog Conclave

Millennia Antica: The Kiln Sitter's Diary

nfoWorks: Pursuing Harmony

Numbering Peano

Orcmid's Lair

Orcmid's Live Hideout

Prof. von Clueless in the Blunder Dome

Spanner Wingnut's Muddleware Lab (experimental)

nfoCentrale Associated Sites

DMA: The Document Management Alliance

DMware: Document Management Interoperability Exchange

Millennia Antica Pottery

The Miser Project

nfoCentrale: the Anchor Site

nfoWare: Information Processing Technology

nfoWorks: Tools for Document Interoperability

NuovoDoc: Design for Document System Interoperability

ODMA Interoperability Exchange

Orcmid's Lair

TROST: Open-System Trustworthiness

2005-10-17

Magical Thinking and the Universal Document Elixir

As long as we’re sitting here by the campfire telling ghost stories about the great OpenDocument vs. Microsoft Office XML FUDwrestle, it is appropriate to discuss the really great idea that the OpenDocument format is designed to be a universal file format such that, according to one commenter “all the information in any file format should be able to be stored in ODF without loss.” It is appealing to then conclude that “this would allow it to be use[d] as the native format in many applications and, most importantly, a universal translation method between any two different formats.”

What a wonderful straw man! What a beautiful dream. A universal format that serves as a universal document model that all formats can be translated through. Douglas Englebart will be very happy to know that this knotty problem is solved and he can get on with the OHS and other projects dear to his heart.

Microsoft: Damned If You Do, Damned If You Don’t

What’s really great about this is how it makes such a cool mouse trap to use on the folks at Microsoft. Here’s where another comment took it:

“If the MSXML binary key and software bindings do not exist, then Microsoft (and everyone else for that matter) should be able to provide the marketplace with clean clear transformation filters enabling easy conversions from MSXML to ODF and back? If they did this, then their software would meet the Massachusetts requirements. But they don't!”

Let me see, we’re supposed to assume that the magical binary key must exist because if it didn’t, there would be transformation filters between MSXML (I am not sure which XML that is, but let’s suppose its the existing WordML just for clarity) and ODF. But there aren’t so the magical invisible binary key must exist? Well, maybe there aren’t because “should be able to” is actually a really hard problem?

There are two difficulties here. One problem is that the commentator is quoting Gary Edwards again and I’d really like to hear from someone else who can speak authoritatively about OpenDocument. I’d like some sense for who else is drinking the same cool-aid and most-of-all who is willing to provide some technical evidence for all of these weird claims. The other thing, and that is what I really want to talk about, is the presumption of universal translatability, if that is what is really meant (e.g., easy conversions over and back).

Is There a Universal Document Format?

I have my doubts whether a universal document format is even possible. I am willing to consider that some practical level of this might be accomplished for a selected set of cases and document models that can be conformed somehow. We’ve barely gotten to that level with programming languages (thanks to the .NET CLI, actually) after a quest of almost 50 years, and programming languages are easier (unless a human has to understand the result, and then it might be harder).

So what I’m looking for is not some vague claim of a dream fulfilled but a simple demonstration of how and what level of universal transformation layer has actually been accomplished. What is the model and what was concluded about the conditions under which inter-translation works? What are/were the metrics?

How’d This Become the Terms of Debate?

The basis for this claim is that interview of Edwards (sorry) where he is reported to have said

“When the Open Document Technical Committee talks about legacy systems, we're talking about at least 30 years of legacy information systems that cross an incredible spectrum of information and file format types. Boeing is an excellent example, and ODF TC member Doug Alberg was a most important driver in the first 18 months of ODF TC work, a period I always refer to as the “universal transformation layer” period because interoperability with legacy information systems was our primary concern.”

The interview continues to reaffirmation of the universal transformation layer with

“The first 18 months of the Open Document project were to perfect the Open Document XML as a transformation layer, where all of these legacy systems could be connected to the transformation layer. Once it's in the common transformation layer, then you can pick and choose which publishing and content management system you would want.”

In the cited examples of publishing and content-management systems, nothing from Microsoft is mentioned. I also don’t see mention of TeX, PDF, DocBook (or SGML generally) or a contemporaneous ISO specification, the Open Document Architecture (ODA). Since these last are well- and fully-specified, I would think they’d make great tests for successful universal transformation.

What You See Is All You Get

Beside Doug Alberg of Boeing, Edwards also gives great credit to “legendary Daniel Vogelheim” (co-architect of the OpenOffice.org XML file format and a Sun Software Engineer) for this period of the work. Vogelheim is more conservative in his stance, according to Eric van der Vlist writing in <?xmlhack?>. It seems that Vogelheim takes “transformability” to mean that the format is usable outside of the office application, something which should be pretty-much true of any XML format for a document and the point of examples that Brian Jones posts about integrating/blending WordML and Excel XML formats with business applications.

The full abstract for Vogelheim’s XML2002 talk expands on this notion. It is clear that extraction and repurposing is intended. Nowhere is there any claim for universal transformation between document formats, something Edwards appears to mean and that everyone else picks up on. This also appears to be the basis for whatever logic has people believe that all Microsoft has to do is adopt the OpenDocument format.

I’m willing to believe that Edwards is serious about this when he makes comments on Bob Sutor’s blog like, “The magic transformation qualities of ODF on the other hand are legendary, and it's only five years old!” I just can’t see anywhere that has been handled.

Show Me the Elixir

Here is where I end up with this. If there were indeed a charge to ensure some degree of universal translation with ODF as an intermediary, there is no evidence of it in the OASIS Specification. I did a search through the PDF for every occurrence of “transformation” in the document. The greatest number of occurrences have to do with transformation as used in presentation systems (such as Adobe Postscript) for transformations of drawing geometries. There are a few cases where design and feature changes are described in terms of making transformation of documents via XSLT a little easier.

The key example, to my mind, is the design goal of having it be possible for any elements below the paragraph to be ignored (that is, the tags are dropped) and the remaining content be appropriate for text extraction. This is nowhere like preserving formatting and document models and whatever else as part of a translation with ODF as a document lingua franca. [It also appears to capture hidden text.] Most of these features are described in terms of how they should make such transformation easier. None of them seem to be about preserving the document in going from/to ODF. I also see this principle as a barrier to the successful translation of non-ODF document architectures to ODF, when that architecture depends on sub-paragraph elements with content that is not intended to be part of the text content at all. (Whether or not that was a good idea, the question is how does one get into ODF with it.)

Now if translation were part of the charter and charge of the Open Document Technical Committee (if you can find it let me know), and some kind of universal document model were achieved, I would expect that

It would be shouted from the rooftops by a wide community of experts.
There would be serious technical excitement about the prospect.
There would be substantial content in the specification, as well as some non-normative appendices, explaining the model, accounting for the benchmarks that were used and reporting how well they were approached.

I find nothing like that. Anywhere.

¶ posted by orcmid at 10/17/2005 12:20:00 PM

from: n4cer Ben Langhinrichs

My FUD is FUDDier than your FUD, so FUD this!

I have been paying attention to the posturing that goes on around OASIS Open Document Format (ODF) and the Microsoft Office XML Reference Schemas (supported now in Office 2003 components) and the Microsoft Office Office Open XML (OX) that will be used as the new default format for Word, Excel, and PowerPoint in the next version of Microsoft Office.

Sometimes people I think are quite senior and knowledgeable seem to take leave of their senses in proclaiming things that years of experience in their organizations should suggest is not quite so bare-faced nor so simply-accomplished.

Then there is the stuff that comes up when someone’s FUD detector has the gain too high and it goes into feedback because someone sneezed in the parking lot.

The Binary Key That Everybody Knows About

The funniest examples, if they weren’t so irritating to me, are the ones that are passed around as technical facts that “everybody knows” and quoted and referenced gleefully but never fact-checked.

One that really gets me is one that some posters have been asking Brian Jones to explain and when he does, suggesting that the truth of the matter is dependent on believing Brian or a comment on Groklaw (no kidding), when there is a simple, confirmable technical fact in dispute.

Here’s what I mean. Gary Edwards is one of the editors of the OASIS Open Document specification. He was interviewed by Christian Einfeld in an article published on Mad Penguin. According to the article, Gary said this:

“.. The problem is the well-known binary key in the Microsoft's XML header of every Microsoft XML document. That binary key holds a great deal of the information that we need about the layout definitions of the Microsoft XML file format. We can do a content-based transformation very well. Microsoft's content is in perfect XML file format. Their styles, though, are locked up in that binary key. To make any kind of exchange possible with Microsoft XML documents, we have to first figure out how to cope with that binary key.”

[Update: The quote is apparently accurate. One of the other places this story is told is in a comment on Bub Sutor’s IBM blog. It seems to be Gary Edwards again. So far, I haven’t found any source for this that doesn’t end up being based on a statement credited to Gary Edwards.]

I keep asking people to show me that key that is so well-known and appears in the header of every Microsoft XML document. Just show me the binary key.

Uh, So How Come It’s Not Here?

I went looking for confirmation. My abandoned M.Sc dissertation draft is a Word 2003 document. So I saved it as XML, then opened it in FrontPage as an XML document. I used FrontPage to pretty-print it so all of the tags line up, the entities are indented, and so on. In the snippet below, I also used Notepad to add further line breaks and indentations to make the tags and elements easier to comprehend. Here’s the beginning of the file:

<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<?mso-application progid="Word.Document"?>
<w:wordDocument xmlns:w="http://schemas.microsoft.com/office/word/2003/wordml"
                xmlns:v="urn:schemas-microsoft-com:vml"
                xmlns:w10="urn:schemas-microsoft-com:office:word"
                xmlns:sl="http://schemas.microsoft.com/schemaLibrary/2003/core"
                xmlns:aml="http://schemas.microsoft.com/aml/2001/core"
                xmlns:wx="http://schemas.microsoft.com/office/word/2003/auxHint"
                xmlns:o="urn:schemas-microsoft-com:office:office"
                xmlns:dt="uuid:C2F41010-65B3-11d1-A29F-00AA00C14882"
                xmlns:st1="urn:schemas-microsoft-com:office:smarttags"
                w:macrosPresent="no" w:embeddedObjPresent="no" w:ocxPresent="no"
                xml:space="preserve">
    <o:SmartTagType o:namespaceuri="urn:schemas-microsoft-com:office:smarttags"
                    o:name="country-region"/>
    <o:SmartTagType o:namespaceuri="urn:schemas-microsoft-com:office:smarttags"
                    o:name="City"/>
    <o:SmartTagType o:namespaceuri="urn:schemas-microsoft-com:office:smarttags"
                    o:name="State"/>
    <o:SmartTagType o:namespaceuri="urn:schemas-microsoft-com:office:smarttags"
                    o:name="PlaceName"/>
    <o:SmartTagType o:namespaceuri="urn:schemas-microsoft-com:office:smarttags"
                    o:name="PlaceType"/>
    <o:SmartTagType o:namespaceuri="urn:schemas-microsoft-com:office:smarttags"
                    o:name="place"/>
    <o:DocumentProperties>
        <o:Title>TROST: Templates for Raising Open-Source Trustworthiness</o:Title>
        <o:Subject>M.Sc in IT dissertation submitted to The University of
                   Liverpool</o:Subject>
        <o:Author>Dennis E. Hamilton</o:Author>
        <o:LastAuthor>Dennis E. Hamilton</o:LastAuthor>
        <o:Revision>2</o:Revision>
        <o:TotalTime>0</o:TotalTime>
        <o:LastPrinted>2005-09-26T01:29:00Z</o:LastPrinted>
        <o:Created>2005-10-17T12:47:00Z</o:Created>
        <o:LastSaved>2005-10-17T12:47:00Z</o:LastSaved>
        <o:Pages>1</o:Pages>
        <o:Words>24471</o:Words>
        <o:Characters>139485</o:Characters>
        <o:Category>TROST-2005-08-05-1021-thesis</o:Category>
        <o:Manager>Gail Miles, Advisor</o:Manager>
        <o:Company>Laureate Online Education, University of Liverpool M.Sc in IT</o:Company>
        <o:Bytes>529408</o:Bytes>
        <o:Lines>1162</o:Lines>
        <o:Paragraphs>327</o:Paragraphs>
        <o:CharactersWithSpaces>163629</o:CharactersWithSpaces>
        <o:Version>11.6502</o:Version>
    </o:DocumentProperties>

and it goes on like that. There is binary content later on, in Base64 encoding. Most of it is for images that I created outside of Word and then included in the document. I gave it that binary. There is also something called <w:fldData> that is scattered throughout my document and its short content is also in what looks like Base64 encoding.

Then I thought that maybe it is the use of a UUID as the URI of a namespace to be used with prefix dt:. I don’t know what that is, and I couldn’t find any actual use of the namespace so I deleted that namespace declaration. When I loaded the XML document in Word, there was no discernible difference. It doesn’t seem to matter.

So, where is this binary key that is so well-known and such a terrible barrier to conversion of Microsoft XML documents to ODF? Where the FUD is it? If it’s so well known and in every Microsoft XML document, where is it?

You Mean to Tell Me Exchange Is Doing It?

I checked the Groklaw post that is supposed to be informative on the matter. It’s apparently from Gary Edwards and it doesn’t say anything about where the key is or what it is in the documents. It says something about how Exchange Server and IE6 are apparently in an act with Word involving a secret transformation to XML and back. I can’t figure out what that’s about and I marvel that this is so well-known, whatever it is. I don’t have Exchange, so there’s no way I can figure out how to test that or even care what some transient XML usage is about. When I ask for XML I don’t get any magical key. That’s all I know.

The comment then goes on to speculate about all of the evils that the existence of this key is evidence for. Then the comment goes off about an XSL/XSLT style sheet, XML2FO.xsl, that Microsoft developers came up with that apparently doesn’t work real great and this is tied back to the mystery key by arguing about whose experts are more expert. I still can’t find the mystery key.

¶ posted by orcmid at 10/17/2005 08:30:00 AM

from:

You are navigating Orcmid's Lair.

template created 2004-06-17-20:01 -0700 (pdt) by orcmid
$$Author: Orcmid $
$$Date: 10-04-30 22:33 $
$$Revision: 21 $