Welcome to Orcmid's Lair, the playground for family connections, pastimes, and scholarly vocation -- the collected professional and recreational work of Dennis E. Hamilton
OOXML-ODF: The Harmonization Hope Chest
Technorati Tags: ODF, OOXML, DIN, Fraunhofer Institute, OOXML/ODF Translator Project, open documents, Office binary formats
[Update 2008-03-06 (via Brian Jones): The DIN NIA 34 working group has raised its head above the waters. It is officially NA 043-01-34-01 VT Working Group Translation 29500-26300. They have published a schedule and the first working-paper draft is available for download. There is a terrific little graphic about the cases they want to understand.
[Update 2008-02-16: Microsoft has delivered on its promise to ECMA TC45, originally reported by Brian Jones on January 16. The specifications for seven binary formats are placed under the Open Specification Promise and freely available for download . The SourceForge project, b2xTranslator, has been established and its initial program of work has been posted . I am hopeful that this work and the conclusion of the DIS 29500 Ballot Resolution meeting will energize the DIN NIA 34 working group on OOXML-ODF translatability .]
This post has four points:
Brian Jones has reported that there is a great deal of interest in harmonization of ODF and OOXML in some way . Subsequently, Brian reported on the ECMA response to ISO/IEC JTC1 DIS 29500 (OOXML) ballot comments requesting harmonization , quoting from the ECMA response:
I agree, this is the best way we know to learn where the stumbling points are and what the prospects are for reconciling the document models. I also appreciate that the ECMA response recognizes that current translation capabilities could be better with improved understanding of the formats, but no claim to perfect translatability is made.
Cautious statements about the prospects for translation are also welcome. With regard to present capabilities, Jones quote of the ECMA response contains this earlier passage :
where that's my emphasis on "useful."
I maintain that getting from "useful" through "even better" to "assured" is unlikely .
I think it is reasonably-well understood that the architectures of OOXML and ODF documents are extremely different. There is great impact on how the format architectures are intended to be processed and how they map to tangible renderings of the document content. These details might be found to have common conceptual abstractions within which translation is reasonably-direct rewriting from one digital format to the other. There is reason for skepticism. Harmonization of terminology and metadata provides some important considerations :
I am skeptical that office-document formats provide a relatively well-constrained domain. Our optimism must be tempered by caution.
I do agree that we need some concrete, specific undertaking to calibrate what kind of harmonization is possible as a practical matter:
It will take serious effort to delineate these cases, and at the moment attention is on the Translation Working Group under DIN NIA-34 [5, 6]. Update: I am heartened to learn that there has been very serious analysis by public agencies of the Danish government . Update Update: I have also unblocked my own thinking about an approach that might provide more concrete visibility on behalf of harmonization .
Meanwhile, Sam Hiser offers a different impression of the DIN effort :
Although the mission of the German effort is translation (Übersetzung), not harmonization, I find there is a very important point that is not made often enough:
I think that is a big deal. It is what people deal with. The formats are below everyone's direct attention and, while standardization and assurance of interchange is essential, it is insufficient, I think, with regard to what works for people who use products that "support" the formats. If interchange is the game, something else is required.
4.1 Living in a World of Appearances
Apart from the tab, space, enter, and back-tab keys, my personal control of layout is confined to an amazingly-prevalent set of toolbar icons:
For each of the HTML editors, I often edit the file format for desired effect. Usually the editors leave my markup alone enough for the result to be achieved. I am reminded of performing similar contortions with Borland Sprint, a document editor that employed a plaintext markup scheme.
MediaWiki editing is also a form of direct markup, although my control over the HTML that is served up is indirect and mastered by trial-and-error post-and-revise exercises.
The important observation is that I use these features for the appearance that is produced and the tools and the symbols encourage me to do exactly that. I need not make any semantic distinctions while using the tools to achieve the formatting that serves my purposes. (I am disgruntled that some products replace indent and outdent with a “” toggle. I don't care that the element involved is tagged <blockquote> and don't get fancy with me.)
4.2 For Users, It is the Computer Programs and What Happens
People ask questions about software products and getting their job done, not formats. On the Massachusetts Information Technology Division site, all of the practical questions are for working between Microsoft Office and OpenOffice.org programs, not how to use the formats:
That is a great testament to the adaptability and resourcefulness of people, something that we are unable to endow in translators of any kind at this time. It is also not clear how this interactive dance fits into an interchange regimen, even among users of the same software product (release).
This passage shows how far the concerns in Massachusetts are from matters that might be governed by the respective formats:
Then there is this wonderful alibi about conversion failures:
We'll see, won't we, especially since the legacy file formats technical specifications have been available for some time and will soon be easier to obtain .
Update: I think there may be ways to offer products that are specifically geared to only using harmonious features of open-standard document formats . This would not require users to know or care about the formats and the ways that their documents comply, and the usual ad hoc behavior would work (so long as one does not have frustrating expectations of feature parity with all products).
This is a very difficult article to write. I thought it would be a quick summary of the situation, with some useful links to the translator projects in NIA-34. It is very disappointing to find so little public evidence of useful activity under NIA-34 and the Fraunhofer Institute, with the only visible progress still that of the OpenXML/ODF Translator Add-In SourceForge project.
[update 2008-02-08T10:35 -0800: I have finally expressed some pent-up ideas about technical approaches to practical harmonization (and determining if there is useful feasibility) . I've updated this post to cross-reference to that emerging material. I'm also enjoying an exchange in the comments, though I want to bring any rejoinders of mine back to the thrust of this article, which does not seem to be fully understood (whether or not agreed with).
This is one of the best posts I have ever read that describes the gap between human activity and work, and what that has to do with representations of information in computer systems. (I'll tip my own hand here and say that the relationship between the two, while important, is tenuous at best.)
Good piece of exposition, orcmid.
I think that this posting misses a few critical things:
(1) ODF isn't limited to OpenOffice. If OpenOffice isn't good enough, another ODF application (such as ones from IBM or Corel) should be. If the next version of Office doesn't have a feature or gratuitously increases your training costs via a radical GUI change (as Microsoft has) or arbitrarily decides to drop legacy file format support (as Microsoft has), you're out of luck.
(2) Microsoft has a poor history of displaying and saving files from one version of Office in another. There's no reason to believe this will change, so people are *already* dealing with this. ODF at least provides some hope of escape since the format is set in stone.
(3) the OOXML spec isn't implemented anywhere, even by Microsoft and there's no guarantee that it will implement the official spec and there's ample history that shows that they'll embrace and extend any spec they support. There's very little change with the DOC format situation.
(4) There's zero reason to mass migrate all legacy DOC files. DOC is a defacto standard that's been reverse engineered to death and has several apps that are able to convert this format into a modern format as the document is needed with high accuracy. If an exact visual match is needed, export to PDF is the the best option.
(5) People don't care about formats until it bites them where it hurts. Anyone who has a drawer full of 8 track tapes or 5.25 inch floppies or Amiga OS 3.5 inch floppies knows that having the data is pointless useless you can access it properly. Anyone with a MS Word 2.0 document knows how badly even MS Office 2000 mangles it. Word 2.0 isn't that long ago, especially in government where many docs need to be kept for over 20 years.
(6) If your data is transitory and you don't care about storing it, then it doesn't matter what format or application you use. Standardizing transient information reduces flexibility within different departments. The only thing that makes sense to standardize this is to avoid vendor lock-in, but even this this case, standardizing any multi-vendor format is good enough.
In short, I don't see what point you're trying to make.
OK, I'll bite:
(1) I haven't said that. I have only indicated that at user level it is not the format that people deal with or express much control over (apart from the general choice), it is the software product, and that applies to all of the software products.
(2) I'm not so sure how poor Microsoft's history is, and we should probably calibrate that somehow. However, ODF is *not* set in stone and it is ingenuous to claim so.
(3) I hear this claim a lot. What is the basis for that? Where is it that new documents in Word 2007 when saved are not in OOXML? Documentation please.
(4) I don't think I raised this at all. Where do you see me promoting or even considering mass migration?
(5) You are making my point about what people care about and what is at their level of attention.
(6) I think the summary at the top of the post says all that I intend. I think it is material when one considers interoperating using products of different vendors. I really was not addressing preservation, but I certainly agree that known standards (de factor or public) matter for that.
Here's my followup:
(1) The point I'm making is that I agree, but that because you're not locked in, you have choice on UI and function, which can't happen with OOXML (which is controlled by one vendor with self interest in vendor lock in). HTML-like formatting is perfectly good for email and it's all that many people need, so why should these users be burdened with something as large and complex as MS Office or OpenOffice? At the opposite end of the spectrum, there are highly technical desktop publishers who need a high degree of control and a lot of the "cute user-friendly" feature of MS Office or OpenOffice get in the way. Each needs a different interface and different application, but both can exchange documents if you have standardized the document format.
(2a) Okay. Calibrate that against paper. I have a stack newspapers that I found between the floors from 1942. Although it's brown, sooty and fragile because it wasn't taken care of, it's still very readable. It paints a very different picture of the time than what's portrayed in the media. Or if you want a software calibration, ASCII text. I have some text documents from my Commodore 64 with my school projects that are still readable. I also have several books with TROFF formatting (created in the 1960s) that still work today on any modern Unix machine. Ditto for LaTex. Any good document format needs to be at least that good at keeping legacy.
(2b) ODF 1.0 *is* set in stone and so are ODF 1.1, etc. And all are backwards compatible since so many parties involved have a vested interest in not rewriting their tools. That's the whole purpose of standardization.
(3) Here's a reference for Microsoft Office.
Interesting. OK, let's continue this a bit.
(1) I don't see how OOXML, the format, has anything to do with UI lock-in. I think the people at MindJet would be surprised. I don't want to speculate so far out to whether there is lock-in. I think you and I are reading different things into my comments about what users do, which applies to any UI and how users train themselves to a UI to get the results they want. I suppose in one sense we are agreeing on the phenomenon, but not on what OOXML has to do with it.
(2a) I meant calibration on the extent to which Microsoft has broken its own formats in their successors. I'm with you on paper, and other enduring formats (text, etc.), although some of those have incompatibilities, whether code-page dependencies in text and e-mail-format difficulties.
(2b) Considering that ODF is still incomplete/underspecified and the conformance requirements are next-to-nill, how can you say ODF is set in stone? I just don't follow that and then ...
(3) Your reference didn't come through. I'd like to see it.
(3b) Also, I should have asked this differently. Name one ODF-supporting processor that implements the ODF spec and only the ODF spec, with no un[der]-specified in ODF content. That means no use of namespaces not supported in the ODF specification (and no abuse of ones that are).
(7) So let's come back to the prospects for harmonization. What are you views on that: possible? not possible? irrelevant? dangerous? what?
Not possible unless dangerous. This is not something Microsoft would want, so they won't co-operate, and their sincere co-operation would be absolutely necessary because OOXML is, despite all the pages and pages, seriously underspecified and anyway a "harmonized" format wouldn't fly if Microsoft didn't adopt it.
But there is one condition under which Microsoft *would* want it: If they controlled it. In which case it would just be a way of killing ODF. They'd end up with a "harmonized" format for all the different word processors to use--that only Microsoft Word could use.
Here's the link that got stripped out:
Good heavens, is that all? This is pretty trivial. If it is the only example it is mildly ridiculous.
I agree that it is a bug, and easy to see how it got there. (I bet MSFT didn't have a test case for this until it came up in the comment about the Translator project.)
I figured there was a problem with Microsoft using more than what was in the spec, but this is a bug in the implementation of an initially-uncommon feature in the spec. It is a bug, though.
This is in the section of OOXML on versioning of the format itself where a new attribute introduced by someone making an extension can be tagged to be ignored, preserved, or deleted by an OOXML processor that doesn't understand the extension.
They were not honoring the instruction to preserve them. I wonder if this was fixed in Office 2007 Service Pack 1? I'll see if I can find the test file and check.
I'll also see if the spec. is solid around what should happen if the content of the element that has such an attribute is edited.
Oops! Not a bug.
OK, I looked into ECMA-379, Part 5, clause 9.1.3, and it is not a bug. The relevant bits are on p.17, lines 5-23 just before clause 18.104.22.168:
"Even in the presence of explicit preservation guidance in a markup specification, any markup editor might choose to discard together all ignored markup without regard to the presence of any PreserveElements or PreserveAttributes attribute. ... [M]arkup consumers shall always accept, but possibly disregard PreserveElements and PreserveAttributes attributes on any element."
It turns out that the PreservElements and PreserveAttributes stipulations are hints, and it is not mandatory to honor them.
So technically, as much as I don't like it myself, the behavior of Microsoft Office Word 2007 is acceptable either way. There are other nuances, but this is the bottom line of what ECMA-357 says on the topic.
I have created a blog and site that is specifically about harmonization of office-productivity format usage to an interoperable level. It is nfoWorks: Pursuing Harmony.
I am going to replicate this post there as part of the history of the project. Unfortunately, the comment thread won't travel with them. I must find something creative to do about that.
It was certainly interesting for me to read the post. Thank you for it. I like such themes and anything that is connected to this matter. BTW, why don't you change design :).
|You are navigating Orcmid's Lair.|