Writings W040400:
Ariane 501: A Costly Risk Myth

orcmid>writings>
2005>04>

w040400>
0.10 2014-02-03 18:01 -0800


In Spring, 2003, I participated in an on-line Software Engineering class using Ian Sommerville's Software Engineering (2001).  In the eighth, last, week of the module, we were assigned a discussion question about liability for the Ariane Flight 501 failure.  In my research, I discovered that the actual cost and loss was exaggerated.  More important than the tendency to exaggeration, the question took as given that a programming error was the source of the failure.  Since then I have encountered numerous retellings of the programming-error myth and I keep hesitating to document my findings.  I think I didn't know how to write this in a way that would not embarrass the designer of the course who was also our instructor and, later on, my dissertation advisor.  But I can't fault the instructor's perpetuation of a myth that was already seized on by a very diverse group of very bright people and promoted as the received wisdom on the matter.  So I shall steel myself for another try at exorcising the Ariane 501 programming-error myth.

Summary

It is simply not the case that a  programming error involving a numeric conversion (some call it an overflow) was the root cause of the failure of Ariane 5 Flight 501.  That's not what happened.  

That's right.  There was no bug.  There was no coding or software-implementation error.  The software did what it was designed and agreed to do. 

The programming-error story has so strongly taken its place in the software technocracy's popular wisdom that the most important and actionable lessons are completely obscured.

There was indeed a chain of events that doomed the flight: an out-of-range data condition in a calculation that wasn't even needed, the by-design throwing of an uncaught exception, and the automatic shutdown of the launch vehicle's active and backup inertial reference systems.  As the result of the unanticipated failure mode and a diagnostic message erroneously treated as data, the guidance system ordered violent attitude correction.  The ensuing disintegration of the over-stressed vehicle triggered the pyrotechnic destruction of the launcher and its payload.

This is not about a programming error.  It is about system-engineering and design failures.  There was not even a finding of liability, and that's important to understand as well.

How the Ariane developers regarded software defects and subsystem failures is more important.  It is also important to appreciate how the particular software was being used under conditions it was not designed for and wasn't even required for.  These were not matters to be solved by improved debugging methods, attention to numerical-computation edge cases, using a programming language other than Ada, or changing programmer-level development methodology.  All of those, as desirable as they might be, are insufficient to save Ariane Flight 501. 

The 1996 Board of Inquiry nailed the important lessons and the appropriate remedies (ESA 1996). 

It becomes interesting to consider why we are so satisfied with the myth instead of the reality:

  1. What actually happened?
  2. What did we made it mean?
  3. How do we get the most-valuable lesson?
  4. Why do we find the myth so much more preferable?

-- Dennis E. Hamilton
Seattle, Washington
2005 December 31

More Information

{customize: Add links to additional content and the progression of analysis.  This and the bibliography are preliminary with more content to be included}

Bibliography

Arnold, Douglas N. (2000
The Explosion of the Ariane 5.  Education Related Materials pages,  Institute for Mathematics and Its Applications, University of Minnesota, Minneapolis, 2000 August 23.   Available at <http://www.ima.umn.edu/~arnold/disasters/ariane.html> (accessed 2003-05-12, 2005-12-29).
     The Ariane 501 flight failure description is found in a collection of "disasters attributable to bad numerics" where the explosion is described as "ultimately the consequence of a simple overflow."  There is in fact no such finding in the Board of Inquiry Report (Lions et.al. 1996).  Arnold's three quotations are from different parts of the report and in a different sequence.  I was led to this version of the myth by its citation in my class-discussion assignment.
     Counseling about the pitfalls of computer arithmetics, and the dangers of "down-casting" overflows is important.  Unfortunately for the appealing case of the Ariane 501 flight, the "simple overflow" occurred with data that was far outside the range required to be accepted in the requirements for the software.   It was an approved design decision to regard such extremity as indication of a system fault and allow an uncaught exception (and subsystem shutdown) if the situation ever occurred.  To the extent that the condition did not arise, this analysis held up for all 113 Ariane 4 missions, from June 1988 until February 2003.
     Arnold has provided valuable scholarship support through preservation of an HTML version of the Board of Inquiry report, without the presentation slides and diagrams, at <http://www.ima.umn.edu/~arnold/disasters/ariane5rep.html>.
           
ESA (1996).
Ariane 501 - Presentation of Inquiry Board report.  ESA Press Release No. 33-1996 (English).  European Space Agency, Paris, 1996 July 23.  Available at <http://www.esa.int/esaCP/Pr_33_1996_p_EN.html> (accessed 2004-04-06, 2005-12-31).
     The PDF version of (Lions et.al. 1996) is linked from this page.
    
ESA (2004).
Lions, Jacques-Louis (Chairman), et.al. (1996)
Ariane 5 Flight 501 Failure.  Report by the Inquiry Board,  Paris, 1996 July 19 (English).  PDF version available at <http://ravel.esrin.esa.it/docs/esa-x-1819eng.pdf>.
    The Inquiry Board makes the following statements in different parts of the report:
     (a) "On 4 June 1996, the maiden flight of the Ariane 5 launcher ended in a failure. Only about 40 seconds after initiation of the flight sequence, at an altitude of about 3700 m, the launcher veered off its flight path, broke up and exploded." -- quoted from the Foreword
     (b) "The internal SRI software exception was caused during execution of a data conversion from 64-bit floating point to 16-bit signed integer value. The floating point number which was converted had a value greater than what could be represented by a 16-bit signed integer." -- quoted from section 2.1, Chain of Technical Events.  That passage continues: "This resulted in an Operand Error. The data conversion instructions (in Ada code) were not protected from causing an Operand Error, ... ."  We would now say that the situation caused an uncaught-exception to be thrown.  The SRI was designed to treat that kind of exception as fatal. The very next passage identifies the problem with the exception being thrown where and when it occurred after lift-off: "The error occurred in a part of the software that only performs alignment of the strap-down inertial platform. This software module computes meaningful results only before lift-off. As soon as the launcher lifts off, this function serves no purpose."
     (c) "The failure of the Ariane 501 was caused by the complete loss of guidance and attitude information 37 seconds after start of the main engine ignition sequence (30 seconds after lift- off). This loss of information was due to specification and design errors in the software of the inertial reference system.' -- quoted from section 3.2, Cause of the Failure.  This short section has only this to add: "The extensive reviews and tests carried out during the Ariane 5 Development Programme did not include adequate analysis and testing of the inertial reference system or of the complete flight control system, which could have detected the potential failure."  It is not numerical analysis that is meant here.  The reference is to analysis of the change of requirements between Ariane 4 and Ariane 5, as well as the consequences of having the alignment subsystem continuing to run after lift-off.  There is also consideration, at a higher level, of the consequences of allowing shutdown in a critical subsystem and also in assuming that the software is trustworthy with exceptions always attributable to equipment failure.
   
Sommerville, Ian (2001).
Sommerville, Ian.  Software Engineering, ed.6.  Addison-Wesley (Boston: 2001).  ISBN 0-201-39815-X.
    "It is not uncommon for verification and validation to take up more than 50 percent of the total development costs for critical-software systems.  This cost is, of course, justified if an expensive system failure is avoided.  For example, in 1996 a mission-critical software system on the Ariane 5 rocket failed and several satellites were destroyed.  The consequential loss was hundreds of millions of dollars." section 2.1, Critical system validation, p.468.
    As a nutshell summary, this is one of the most-precise statements that I have found.   The Inertial Reference System clearly failed and there is great emphasis on system validation in the remedies.
  

0.10 2005-12-31-18:06 Refactor as Part of Writings/2004/04/W040400 Organization
The original front page is moved to serve as the second version of my account and the present page is updated to become a folio cover with essentially bibliographic and table-of-contents functions.
0.00 2004-04-05-18:21 Create Basic Article Structure
Bill Anderson called and pointed out one more place where the folklore about the Ariane 501 failure as a software problem has come up once again.  We want to have something somewhere to discuss the strangeness of that persistent claim when the failure was quite different than that.  I begin by gathering the notes I have, and sketching the approach.
   

Construction Zone (Hard Hat Area) You are navigating Orcmid's Lair

created 2004-04-05-18:21 -0700 (pdt) by orcmid
$$Author: Orcmid $
$$Date: 14-02-03 18:01 $
$$Revision: 83 $