SCIENCE

No, a “checklist error” did not almost derail the first moon landing

From the archives: The cause of Apollo 11’s landing alarms is a lot more complicated.

2019-07-05 10:07

Last week was the forty-sixth anniversary of the Apollo 11 moon landing—the first of the six crewed landings on our nearest celestial neighbor. In the years between 1969 and 1972, 12 human beings walked on the surface of the moon: Neil Armstrong, Buzz Aldrin, Pete Conrad, Al Bean, Alan Shepard, Ed Mitchell, Dave Scott, Jim Irwin, John Young, Charlie Duke, Jack Schmitt, and Gene Cernan. Each Apollo landing by necessity leapfrogged the previous by some notable amount, because even as Apollo 11 was preparing to lift off it was obvious that the money wasn’t coming and Project Apollo might be the only chance to visit the moon—perhaps for a long, long time. Even though Apollo 10’s "dress rehearsal" had taken NASA through all but the final phase of the lunar landing two months before, there were still a large number of unknowns in play when Neil Armstrong and Buzz Aldrin separated Eagle from Columbia, leaving Michael Collins to watch his crewmates descend to the lunar surface—perhaps to stay there forever. And as it turned out, the first landing on the moon almost did encounter disaster. Shortly after Eagle entered one of the most complicated stages of the descent, the guidance computer began throwing off alarms—very serious alarms, of a type no one in mission control or on the spacecraft was immediately familiar with. Back at MOCR2 in Houston, the burden to determine whether or not the alarms were benign—and therefore the decision to determine whether to abort the landing, blow the Eagle in half, and make an emergency burn to try to make it back up to Columbia—fell on the shoulders of two people: guidance controller Steve Bales and backroom guidance specialist Jack Garman.

No common pocket calculator

It’s an accepted axiom that the Apollo missions flew to the moon on a computer variously described as "less powerful than a pocket calculator" or "less powerful than a digital watch" or other similarly deprecatory statements, but that’s a half-truth. While shockingly primitive at first glance to modern eyes, the Apollo Guidance Computer was a capstone of engineering achievement in the context of the 1960s; further, the software that ran on it was almost miraculously sophisticated by the standards of the day. The executive system—written in part by software geniuses like Hal Laning at MIT—pioneered many of the ideas behind real-time computing, and many of the principles first put into effect in the various revisions of the Apollo Guidance Computer’s software are still used in real-time systems today. The Apollo Guidance Computer really isn’t a general purpose computer at all. It had no need to address a complicated set of peripherals through some kind of hardware abstraction layer; it had no need to parse English-style commands; it had no high-level programming language to interpret. As explained by Woods in How Apollo Flew to the Moon, the Apollo Guidance Computer can be best understood as a sophisticated embedded controller, built and wedded to the hardware of its host vehicle. The AGC in the Command Module was built to control the Apollo spacecraft as it flew toward the moon, keeping track of where it was and where it was going (in the form of its state vector relative to one of several different points of reference, which changed as the mission progressed). The AGC in the Lunar Module, on the other hand, had an entirely different set of tasks focused on getting the lander out of orbit and onto the lunar surface and then back up again. Though the two computers were similar from a hardware perspective—each using blocks of integrated circuits and rope core memory to store and operate on 16-bit words—each ran very different sets of hardcoded software. However, one thing in common between the two computers’ software was their basic bifurcated structure. Split into two "halves," each guidance computer ran a main program called the "Executive," which O’Brien calls "a priority-scheduled multiprogramming operating system," and another called the "Interpreter," which created sort of a "virtual computer" that allowed complex program actions (including work with complex data types, vector math, transcendental functions, and many other things) to be accomplished using the limited hardware instructions available. It’s the Executive that we’re going to look at very briefly—because the design of the Executive almost certainly saved Apollo 11.

Spinning plates

The descent of an Apollo Lunar Module like Eagle from an Apollo Command and Service Module like Columbia was a complex operation, which NASA chopped up into a number of phases. To go from orbit to landing, the Lunar Module first pointed its ventrally mounted descent engine forward into its direction of travel and began to thrust. In the vagaries of orbital mechanics, this "retrograde" thrust lowered the altitude of the LM’s orbit at the point directly opposite its current location. Thrust long enough, and eventually your altitude would drop so much that your orbit would stop being an orbit. For the LM, the retrograde thrust was designed to scrub velocity and altitude and continued for a predetermined amount of time. During this burn, the LM’s copy of the Apollo Guidance Computer was running Program 63 as its "major mode"—that is, while the Executive could do a number of things simultaneously, Program 63 was the thing it was focusing on as its foreground task. P63 is, not surprisingly, the braking phase guidance program, designed to keep the LM oriented for its retrograde burn and to keep that burn within the desired parameters so that the LM decelerated on profile. At a certain point after a large amount of velocity had been scrubbed, the computer shifted from Program 63 to Program 64 for approach phase guidance. During P63, the crew can’t see the moon—they have their backs to the lunar surface and are looking "up" at space (for Apollo 11, the crew had started their powered descent facing downward, so they could see the surface, then yawed the LM 180 degrees in order to be properly positioned for P64). When P64 was triggered, the LM underwent a "pitchover" maneuver, wherein the spacecraft changed its orientation so that it was mostly upright and in position for landing. After the maneuver was complete, the LM’s descent engine was mostly pointed at the lunar surface and the crew was mostly facing forward, and the remainder of the descent was done with the LM balancing on its engine (this is why the Apollo landings were called "powered descents"—the LM rode its own engine down).

The phase of flight that got kicked off with P64 was the busiest of the entire landing, for both the crew and the computer. From the crew’s perspective, it was when the commander (Armstrong) had to begin actively eyeballing the upcoming landing site through his window and assisting the computer with guidance; the Lunar Module Pilot (Aldrin) stayed eyes-down, looking at a number of instruments and telling the commander a whole mess of readings about the LM’s descent rate, altitude, horizontal velocity, fuel state, and a few other things—things the commander couldn’t remove his eyes from the window to look at. The computer, meanwhile, was busy with its own tasks. The LM’s Executive maintained a prioritized list of tasks it had to get to, and the computer cycled through listening for all of its interrupts once every 960 milliseconds. During P64 the computer had a lot to do—during this phase of flight, about 85 percent of the computer’s "processing power" was tied up. It was a bad time for something to go wrong—so, of course, that's when things started to go wrong.

Program alarm

A few minutes prior to the P64 pitchover, Armstrong relayed to Houston that, based on what he and Aldrin were seeing from their instruments, they were going to overshoot their intended landing point. While controllers chewed on this and Armstrong debated taking manual control of the landing, the console lit up with an alarm. "Program alarm," called out Armstrong over the air-to-ground loop, although controllers on the ground had far more visibility into the LM’s systems than did he and Aldrin and could also see the alarm. And four seconds later, after consulting the guidance computer’s five-line electroluminescent display, he clarified, "It’s a 1202."

A "program alarm" during landing was something that had been simulated, multiple times, in the run-up to the landing. The first time controllers had faced a similar alarm, they’d called for an immediate abort—an action that would have terminated the landing if it had been real and called for an immediate return to Earth. After the simulated abort, though, flight directors wondered if the controllers hadn’t been too quick on the abort button. Was a hair-trigger abort necessary to keep the crew alive, or might there be time in the landing to troubleshoot the alarm? And, to be clear, an abort during landing was not a minor thing. The procedure would have involved Armstrong pressing the "ABORT STAGE" button on the LM’s panel, which would have fired explosive bolts and guillotines and separated the LM’s ascent stage from its descent stage. Then, the ascent engine would fire, doing its best to add velocity back to the descending ship, attempting to push it back into some kind of stable orbit so that the crew could find and rendezvous with the Command Module. It was something the crews trained to do—but it wouldn’t have been easy. And it would have carried with it the stigma of an aborted mission. In fact, in an effort to ensure that the LM crew could immediately get to the business of finding the Command Module in the event of an aborted landing, Buzz Aldrin had made sure to activate the LM’s rendezvous radar—and that, inadvertently, turned out to be the source of the alarms.

Murphy’s Law

"Give us a reading on the 1202 program alarm," prompted Armstrong after about 18 seconds of silence from the ground. "Roger," came the voice of CAPCOM Charlie Duke, "We got—we’re go on that alarm." Those 18 seconds had been filled with frantic thinking on the ground by Steve Bales and Jack Garman. Flight Director Gene Kranz had looked to Bales for an immediate answer—was the crew in danger? What did the alarm mean?—and Bales had looked to Garman, his guidance software expert. After the simulated abort, Garman had been instructed by Kranz to write down and memorize every single possible LM guidance computer error message so that in the event of an actual error during flight, someone would be able to make an immediate and informed call based on the error type. Today, we might have this kind of thing stored on an intranet page or bookmarked on our smartphones, but with the primitive mainframe-based computers that powered Mission Control from the Real Time Computing Complex, it was a lot easier to simply have it written on paper. In the space of a few seconds, Bales had passed the alarm to Garman, who quickly responded that the alarm was noncritical and that the landing could go on, as long as the alarm didn’t become continuous. And, just four minutes later, another alarm of the same type occurred; as the crew radioed it down to ground control, Garman called out almost before the readback was completed that the alarm was of the same type—the landing could continue, and it did in spite of one more program alarm after that. The alarms ceased a minute later when the crew shifted the computer out of Program 64 and into P66 for the terminal landing phase. Apollo 11 landed safely.

A crazy confluence of events

So what, exactly, happened? What are 1201 and 1202 program alarms, and what caused them? And why are the alarms so often blamed on a "checklist error?" Last things first: the "checklist error" accusation comes up often because the circumstances that caused the alarms were brought about by Aldrin turning on the LM’s rendezvous radar and setting its mode switch to its SLEW position, which energized the radar’s Coupling Data Units and allowed the computer to read data from, and control, the steerable rendezvous radar antenna. Activating the rendezvous radar during landing—which, again, would lessen the workload on the crew in the event of an abort, because the system they’d need to use to find Columbia would already be operational—was not at all a "checklist error." This was established procedure. The cause of the program alarms is best explained (in extreme, penetrating detail) by MIT veteran Don Eyles, who worked on the Apollo Guidance Computer’s software. The problem wasn’t a checklist error—it was more properly a design documentation error. The LM’s rendezvous radar contained a collection of electronics called the Attitude, Translation, and Control Assembly, or ATCA. The ATCA was responsible for providing an electrical interface whereby the LM’s guidance computer could control the radar’s hardware, and the ATCA was powered by 800Hz, 28-volt alternating current. The guidance computer in turn used a piece of equipment called a Coupling Data Unit, or CDU, to read the orientation of the radar’s antenna (its shaft and trunnion angles) so that the guidance computer could keep track of where the radar was pointed. The CDUs—there were actually two of them—were also powered by a separate 800Hz, 28-volt AC reference signal. Between the ATCA and the CDUs, the guidance computer could both control and understand the position of the radar.

Here’s where the problem arose. In order for the CDUs to actually make sense of the data they were supposed to be tracking—the shaft and trunnion angles of the radar’s antenna—the two separate 800Hz 28VAC feeds to the ATCA and the CDUs needed to be both frequency-locked and phase-synchronized. However, according to Eyles, the interface control document that defined the parameters of the systems didn’t actually call for phase synchronization—just frequency locking. So, no provision was made to ensure the CDUs’ power and the ATCA’s power would be locked to the same phase. Depending on the exact millisecond when the rendezvous radar was powered on and the mode switch changed to "SLEW," the two different power supplies could be set any which way—they’d be at the exact same frequency, but could be at any relative phase. It just so happened that on Apollo 11, the stars aligned and Aldrin activated the radar at just the right millisecond in time such that when power hit the CDUs, that power was out of phase from the ATCA’s power. The electrical resolvers that reported the shaft and trunnion angles of the radar to the CDUs were being excited by out-of-phase power from the CDUs' perspective, and the CDUs interpreted those readings as being far out of its normal expected range. This effectively caused the CDUs to freak out. Faced with out-of-bounds readings for the radar’s hardware, each CDU began to issue radar increment and decrement interrupts to the guidance computer—lots of interrupts. 12,800 interrupts per second between the two of them, in fact. The interrupts would normally be processed by the guidance computer and then used to tell the ATCA to align the dish, but the parameters were out of normal bounds and there was no way to move the dish to where they were directing. The guidance computer already had its hands full with all of the things it had to do during powered descent, and was at about 85 percent capacity; the extra interrupts required another approximately 15 percent capacity to deal with. This made the Executive too busy to get to everything on its priority list before its 960ms list cycle had completed; this, in turn, caused two of the guidance computer’s storage areas to overflow.

The first program alarm was 1202, which translated to "Executive overflow—no VAC areas." "VAC" in this context means "vector accumulators," which are unswitched memory areas that store temporary variables for interpretive instructions. The guidance computer’s program flow was carefully designed so that it was ordinarily impossible to run out of this kind of temporary storage—and yet, because the computer was busier than should have been possible, it happened, and the alarm sounded. When the 1202 alarm went off, the Executive initiated the first and least-disruptive of its three restart modes to try to fix the problem: a "BAILOUT" restart. The "BAILOUT" restart routine killed all of the computer's running jobs, then had the Executive call up its prioritized table of what it was supposed to be doing and automatically picked back up its tasks, one by one. This restart included a flush of all the temporary storage areas (like the VACs), and at the same time the computer made sure to continue tracking the LM’s state vector—that is, its position and velocity relative to its current reference point. This proved enough to deal with the problem. The second alarm that appeared, 1201, was of the same type—this time, it was "Executive overflow—no core sets." Somewhat similar to VACs but simpler, the core set area of the guidance computer was used to store all the necessary information about each running program in the guidance computer. As with the VACs, careful program flow planning ensured that there should always be enough free core sets for the running programs to do their jobs—except when suddenly there wasn’t. And, as with the 1202 alarm, the computer quickly dropped and flushed everything it was doing and then resumed its tasks, in order of priority—all without forgetting where it was going and what it was supposed to be doing.

More serious errors?

If the problems had been more severe—or if the alarms had been continuous—it’s possible the guidance computer would have stepped things up and triggered its second kind of restart: a "P00DOO" restart. Named "P00DOO" partially because the routine forced the computer's major mode to Program 00 (the idle program) and partially because if it ever occurred during a mission, controllers and astronauts alike would probably fill their pants, a P00DOO restart caused the computer to stop what it was doing just like a quick restart. However, the similarities ended there. A soft restart kept the major mode the same—so a restart during P63 or P64 kept the computer running P64 or P64 and kept the LM landing. A P00DOO restart during powered descent would have reset the computer to P00, idle, and left all of the VACs and core sets alone for troubleshooting. During a landing, this would have been extremely problematic—even dangerous. The computer wouldn’t have lost its state vector and the descent engine wouldn't have stopped firing, but a number of the instruments the crew was using (like the landing radar altimeter) would no longer have valid readings. The crew would almost certainly have needed to trigger an immediate abort—with all the spacecraft-halving explosive fun that would have entailed—and then switch away to the much more limited Abort Guidance System (an extremely cut-down baby version of the guidance computer) to guide them back to the Command Module. The third, most serious reset would have been extremely bad as well. As a last resort, the guidance computer could call the "FRESHSTRT" routine, which would completely reset everything, losing its guidance reference information and essentially "rebooting" in a fresh state. With no guidance information at all, continuing to the lunar surface wouldn’t have been an option—a switch to the backup Abort Guidance System would have been mandated and abort would have been required.

Checklist schmecklist

Ultimately, what saved Apollo 11 comes back to the same thing that saved every other Apollo flight where some kind of near-critical incident arose (and, for the record, that was basically every single Apollo flight): exhaustive training and readiness on the part of every person involved, coupled with smartly designed systems. The brilliance of the programming in play on the Apollo Guidance Computer can't be overstated: it's one thing to design a real-time guidance system today, in this era of GPS satellites and smartphones and decades of software design best practices. It was entirely another thing to do it in the 1960s, with no off-the-shelf parts available and no real established software design guidelines to follow—and to do it so well that not only does it handle unexpected failures, but it does so while landing on the Moon. There’s a tremendous amount of information out there on the Apollo Guidance Computer. It’s an amazing bit of hardware, and even now it still has stories to tell. If you’d like to fully immerse yourself into the world of Apollo and its guidance computer, there are two books that you should immediately acquire: How Apollo Flew To the Moon, by David Woods, and The Apollo Guidance Computer: Architecture and Operation, by Frank O’Brien. The former gives a wonderful overview of the entire trip from the Earth to the Moon, while the latter provides all the information you could ever want on the hows and whys of the computer itself. There’s also a plethora of documentaries on the subject—Discovery’s Moon Machines series has a good episode on the Apollo Guidance Computer, for example. In any case, the next time you see a space enthusiast saying that Apollo 11 almost aborted because of "a checklist error," you’ve got the ammunition to set 'em straight. It wasn’t a checklist error: it was an absurd confluence of events that started with a documentation error and ended up with a switch being flicked at precisely the right (or wrong) fraction of a second. And it was the split second decisions of a bunch of young folks in Houston and MIT that saved the mission from an almost certain abort. For his quick thinking, folks at NASA began referring to Jack Garman as "Gar-Flash"—a good name for a steely eyed missile-man. Editor's Note (7/28/15): I've modified a few specifics in this piece thanks to input from Apollo Lunar Surface Journal contributor Paul Fjeld. Thanks, Paul!