Crash and Burn — a short story of Ariane 5 Flight 501

Written by hackernoon-archives | Published 2018/06/03
Tech Story Tags: software-development | tech-history | quality-assurance | project-management | ariane-5-flight-501

TLDRvia the TL;DR App

The 22nd anniversary of Ariane 5 Flight 501 offers an opportunity to reflect upon software defects, project errors, and the best principles and practices for solution delivery in the IT industry. In this blog and my upcoming book, Bugs: A Short History of Software Imperfection, I will chronicle some important failures in the past, explain how we arrived at the present, and discuss some ideas for improving the future of software quality. As information technology becomes increasingly woven into Life, the quality of software impacts our commerce, health, infrastructure, military, politics, science, security, and transportation. The Big Idea is that we have no choice but to get better at delivering technology solutions because we have to.

On June 4, 1996 in Kourou, French Guiana, the maiden flight 501 of the Ariane 5 rocket ended almost as soon as it began. About 37 seconds after the initial launch sequence (30 seconds after takeoff), at an altitude of 4000 meters, the rocket deviated 90 degrees from its intended flight path due to a software failure, experienced severe aerodynamic stress tearing its boosters from the main stage, and thereby triggered a controlled self-destruction that culminated in the spacecraft exploding in a fireball of liquid hydrogen.

The European Space Agency (ESA) had ambitions to take a leadership role in the commercial space business and surpass Japan, Russia, and the USA. The Ariane 4 (A4) had been in service for more than 20 years and boasted an excellent record of more than 100 successful launches with no failures. The new Ariane 5 (A5) rocket would carry larger satellite payloads than earlier versions, and flight 501 was carrying a payload of four satellites intended for researching the Earth’s magnetosphere. ESA had spent 10 years and $7 billion developing the A5, and flight 501 itself cost $370 million. The success of the Ariane 4 and ESA budget pressures resulted in the reuse of A4 software by the A5 program team including its navigation system and flight path optimization libraries.

ESA organized an inquiry board immediately after the crash to investigate the disaster and using flight data, optical observations (IR camera, film), inspection of recovered material, and review of the software code, the board identified the following sequence of events that resulted in the crash.

  • At T + 37 seconds, the Internal Reference System (SRI) that measured the rocket’s spatial attitude (altitude, motion, and position) sent incorrect data to the Flight Control System instead of the actual flight data because an arithmetic overflow occurred inside its alignment function when a 64-bit floating point number for the Horizontal Basis variable (BH) could not be cast and converted to a signed 16-bit integer. The SRI in the A5 was reused as a black box from the A4. Furthermore, the BH value was higher than expected because the early part of the A5 trajectory differed from the A4 and resulted in higher horizontal velocity values (five times as much). This error did not occur earlier in the flight because the vehicle speed was lower and the calculated values were small enough to fit into the program’s data types. Since there was no exception handling in the Ada code for this alignment function, the operand error bubbled up and the SRI component entered a failed state and returned a diagnostic value, intended for debugging purposes only.
  • The backup SRI, identical in hardware and software to the active SRI, could not be activated because it had failed for the same reason.
  • At T + 38 seconds, the On-Board Computer that executed the flight program then commanded course corrections as a result of the incorrect SRI data, and the rocket’s nozzles were deflected changing the flight path at an extreme angle.
  • At T + 39 seconds, high aerodynamic forces led to separation of the boosters from the main stage and triggered the self-destruct subsystem.

Ariane 5 Flight 501 @ T+ 39 seconds

The inquiry board further analyzed the SRI software and overall A5 program and arrived at several conclusions:

  • The SRI alignment function was used to perform ground-based alignment of the inertial platform prior to lift-off (around T-3 seconds) and once the rocket took off the alignment function would not serve a purpose. It was left running in the A5 for the first 50 seconds as a “special feature” in case the system needed to be restarted in the event of a brief pause in the countdown before lift off; such resets could take hours in the A4 and this would speed up the process.
  • The SRI code had been analyzed for exceptions by the A4 team, and seven variables were deemed at risk of operand error. Since a maximum SRI CPU utilization target of 80% had been set for the A4, only four variables were protected and three were left unprotected, including the BH variable. The original reason given for this decision was that “they were physically limited or there was a large margin of safety.” However, the CPU constraint only applied to the A4 —not the A5, and the SRI code was never re-analyzed by the A5 team using realistic A5 input data.
  • The operand error in the SRI was not sufficient for system failure. The specification and design of the SRI exception handling mechanism also contributed to the failure. In the event of any exception, the system specification stated the failure should be indicated on the data bus, the failure context should stored in an EEPROM memory, and finally the SRI processor should be shut down. The reason behind this acute action was the engineering culture within the Ariane program focused on hardware failure instead of software failures since the former occurred more often than the latter. A rational approach to handling random hardware failures is to shutdown active systems and switch to the backup. However, a better approach in this software fault scenario would have been to provide best-effort estimates of the required altitude, position, and velocity. The inquiry board wrote that “software should be assumed faulty until applying the currently accepted best practice methods can demonstrate that it is correct.”
  • While there was some unit testing and integration testing with A4 data, neither end-to-end, integration testing with hardware and software nor test simulations with realistic data from the A5 trajectory data were ever performed. Post-501 flight simulations running the SRI software in the context of actual 501 trajectory data reproduced the chain of events leading to the SRI failure.

The inquiry board made a number of recommendations, and they can be generalized into lessons learned from this case study that are useful to IT professionals.

  • Don’t run code or systems that you don’t need (R1). The SRI alignment function should have been switched off after lift-off. This Devops mistake is avoidable, easy to correct, and happens too often (think Knight Capital).
  • Quality Assurance matters (R2, R10, R11). If you don’t test the system end-to-end with high coverage of realistic positive and negative scenarios, then your product and project expectations are dreams — not empirically grounded hypotheses.
  • The smallest code Quality details matter whether its Ada, C, JavaScript, SQL, or Python (R4). An arithmetic conversion error caused the loss of a multi-million dollar spacecraft and set back a multi-billion dollar program for several years. In our own projects and systems, we must see the forest and the trees.
  • The system Complexity matters. One should carefully consider what components are critical, understand their fault surfaces, and avoid single points of failure (R6, R8, R13). Reusing someone else’s software artifacts just because it worked for them is not sufficient for your success. We need to examine the software’s assertions (think design by contract), inspect its tests, question assumptions, and think through dependencies. As an aside, one good example of quality transparency is APC; when you purchase a power supply unit it comes with the the unit test output from the factory.
  • Culture matters and can sometimes eat strategy for breakfast as well as your technology project (R14). Besides a design bias to mitigating failure through shutdown and backup failover, there were QA shortcuts taken, aggressive borrowing of A4 code, and there was no single point of accountability on the A5 team. The ESA held no one responsible for the failure — a classic tragedy of the commons.

The failure of the 501 highlighted risks with complex, costly computing systems to the general public, politicians, and business executives. It resulted in increased support for research on ensuring reliability of safety-critical systems. Automated analysis of the Ariane code written in Ada was one of the first examples of large scale static code analysis.

Afterwards, four replacement Cluster satellites were built and launched in pairs aboard Soyuz-U/Fregat rockets in 2000. The Ariane 5 program resumed, had dozens of successful launches and hundreds of satellite deployments, and is still active. The successor vehicle, the Ariane 6, is under development and there are plans to enter it into service in the 2020's.

References


Published by HackerNoon on 2018/06/03