Professor von Clueless in the Blunder Dome: Failure Is Your Friend

Professor von Clueless in the Blunder Dome

privacy

Hangout for experimental confirmation and demonstration of software, computing, and networking. The exercises don't always work out. The professor is a bumbler and the laboratory assistant is a skanky dufus.

Blog Feed

Recent Items

Encouraging Open-Source Development

Streamlining Software Upgrade

Trusting Wireless Routing

CNR Area Pisa Italiano

Digital Libraries on Demand

Sxip - Pronounced "Skip" Is Digital Identity

When to Optimize, When to Tune

Smartcard Trustworthiness

Emulating Obsolete Document Formats

If Metadata Is the Answer, What Is the Question?

visits to Orcmid's Lair pages

The nfoCentrale Blog Conclave

Millennia Antica: The Kiln Sitter's Diary

nfoWorks: Pursuing Harmony

Numbering Peano

Orcmid's Lair

Orcmid's Live Hideout

Prof. von Clueless in the Blunder Dome

Spanner Wingnut's Muddleware Lab (experimental)

nfoCentrale Associated Sites

DMA: The Document Management Alliance

DMware: Document Management Interoperability Exchange

Millennia Antica Pottery

The Miser Project

nfoCentrale: the Anchor Site

nfoWare: Information Processing Technology

nfoWorks: Tools for Document Interoperability

NuovoDoc: Design for Document System Interoperability

ODMA Interoperability Exchange

Orcmid's Lair

TROST: Open-System Trustworthiness

2004-12-05

Failure Is Your Friend

ACM Queue - A Conversation with Bruce Lindsay - Designing for failure may be the key to success.. Although the title maxim has lots of implications, here I want it to attract you to a comprehensive and breezy little article that shows how much is to be gained by having a healthy respect for failure. IBM Fellow Bruce Lindsay basically goes through the whole set of architectural principles around the detection and handling of failure by software and how, in general terms, reliable systems are achieved with unreliable components. One key factor that Bruce underscores is that you don't want to go hunting for failures with a modified or adjusted version of the system (no debugging version versus production version, to be harsh about it). There should only be one version and it is the same one that you troubleshoot as do production with. That leaves a new problem with regard to needing to be able to inject failures, in some benign way, so that it is always confirmable that the recovery and mitigation processes work and those paths operate -- and operational personnel are familiar with seeing them, if they are visible at all. I think of this as a software fire drill and I was led to this as the result of two experiences. The first experience was when we were building our first (nearly) IBM plug-compatible byte-oriented computers at Sperry Univac. We were also going to be shipping our own disk drives, but on the development floor we had IBM disk drives on the prototypes so we could get the operating system built. Things were going swimmingly until we received the first Univac drives from the factory. These early-production drives were not so reliable as the IBM ones, and suddenly the operating system was crashing a lot. The error-recovery software in the disk-handling bowels of the operating system had never been checked and live tested. The fault-handling software had not been thoroughly exercised and we saw the worst possible Maytag repairman problem. There may have been disk-error-related crashes before, but they may have gone unnoticed among the other surface issues that were being worked out and debugged at the time. The idea of wiring in software fire drills and actually injecting reversible errors into the stack of operations is something that I applied in the early 70's when I was writing some multi-threaded terminal-cluster control software atop an operating system designed for real-time applications. The fact that I knew there were going to be errors, and I had to make sure that I could reverse them, had tremendous impact on the design approach. And it worked so well that I never had to actually inject errors: the hardware in the terminal controllers had race conditions that gave me ample real-life failures to be assured that the recovery software was exercised early and often. The only problem with that was that there was an over-simplified channel coupler between the computer and the terminal controller, with no way the computer could programmatically reset the controller once the controller went autistic. Everything in my computer-side control software failed soft, but all work stopped until the administrator powered-down and back up. It was not a pretty sight. The software was more reliable than the hardware and some false economies in the hardware left us with no way to do anything about it. This is part of the story at the bottom of the page here.

¶ posted by orcmid at 12/05/2004 09:28:00 AM

Create a Link - Post a Comment -

You are navigating Orcmid's Lair.

template created 2004-06-17-20:01 -0700 (pdt) by orcmid
$$Author: Orcmid $
$$Date: 10-04-30 22:33 $
$$Revision: 21 $