An Investigation of the Therac-25 Accidents -- Part II

Nancy Leveson, University of Washington
Clark S. Turner, University of California, Irvine

Reprinted with permission, IEEE Computer, Vol. 26, No. 7, July 1993, pp. 18-41.

Yakima Valley Memorial Hospital, 1985. As with the Kennestone overdose, machine malfunction in this accident in Yakima, Washington, was not acknowledged until after later accidents were understood.

The Therac-25 at Yakima had been modified in September 1985 in response to the overdose at Hamilton. During December 1985, a woman came in for treatment with the Therac-25. She developed erythema (excessive reddening of the skin) in a parallel striped pattern at one port site (her right hip) after one of the treatments. Despite this, she continued to be treated by the Therac-25 because the cause of her reaction was not determined to be abnormal until January or February of 1986. On January 6, 1986, her treatments were completed.

The staff monitored the skin reaction closely and attempted to find possible causes. The open slots in the blocking trays in the Therac-25 could have produced such a striped pattern, but by the time the skin reaction had been determined to be abnormal, the blocking trays had been discarded. The blocking arrangement and tray striping orientation could not be reproduced. A reaction to chemotherapy was ruled out because that should have produced reactions at the other ports and would not have produced stripes. When it was discovered that the woman slept with a heating pad, a possible explanation was offered on the basis of the parallel wires that deliver the heat in such pads. The staff x-rayed the heating pad and discovered that the wire pattern did not correspond to the erythema pattern on the patient's hip.

The hospital staff sent a letter to AECL on January 31, and they also spoke on the phone with the AECL technical support supervisor. On February 24, 1986, the AECL technical support supervisor sent a written response to the director of radiation therapy at Yakima saying, "After careful consideration, we are of the opinion that this damage could not have been produced by any malfunction of the Therac-25 or by any operator error." The letter goes on to support this opinion by listing two pages of technical reasons why an overdose by the Therac-25 was impossible, along with the additional argument that there have "apparently been no other instances of similar damage to this or other patients." The letter ends, "In closing, I wish to advise that this matter has been brought to the attention of our Hazards Committee, as is normal practice."

The hospital staff eventually ascribed the skin/tissue problem to "cause unknown." In a report written on this first Yakima incident after another Yakima overdose a year later (described in a later section), the medical physicist involved wrote

At that time, we did not believe that [the patient] was overdosed because the manufacturer had installed additional hardware and software safety devices to the accelerator.

In a letter from the manufacturer dated 16-Sep-85, it is stated that "Analysis of the hazard rate resulting from these modifications indicates an improvement of at least five orders of magnitude"! With such an improvement in safety (10,000,000 percent) we did not believe that there could have been any accelerator malfunction. These modifications to the accelerator were completed on 5,6-Sep-85.

Even with fairly sophisticated physics support, the hospital staff, as users, did not have the ability to investigate the possibility of machine malfunction further. They were not aware of any other incidents, and, in fact, were told that there had been none, so there was no reason for them to pursue the matter. However, it seems that the fact that three similar incidents had occurred with this equipment should have triggered some suspicion and investigation by the manufacturer and the appropriate government agencies. This assumes, of course, that these incidents were all reported and known by AECL and by the government regulators. If they were not, then it is appropriate to ask why they were not and how this could be remedied in the future.

About a year later (in February 1987), after the second Yakima overdose led the hospital staff to suspect that the first injury had been due to a Therac-25 fault, the staff investigated and found that this patient had a chronic skin ulcer, tissue necrosis (death) under the skin, and was in constant pain. This was surgically repaired, skin grafts were made, and the symptoms relieved. The patient is alive today, with minor disability and some scarring related to the overdose. The hospital staff concluded that the dose accidentally delivered to this patient must have been much lower than in the second accident, as the reaction was significantly less intense and necrosis did not develop until six to eight months after exposure. Some other factors related to the place on the body where the overdose occurred also kept her from having more significant problems as a result of the exposure.

East Texas Cancer Center, March 1986. More is known about the Tyler, Texas, accidents than the others because of the diligence of the Tyler hospital physicist, Fritz Hager, without whose efforts the understanding of the software problems might have been delayed even further.

The Therac-25 was at the East Texas Cancer Center (ETCC) for two years before the first serious accident occurred; during that time, more than 500 patients had been treated. On March 21, 1986, a male patient came into ETCC for his ninth treatment on the Therac-25, one of a series prescribed as follow-up to the removal of a tumor from his back.

The patient's treatment was to be a 22-MeV electron-beam treatment of 180 rads over a 10 x 17-cm field on the upper back and a little to the left of his spine, or a total of 6,000 rads over a period of 6 1/2 weeks. He was taken into the treatment room and placed face down on the treatment table. The operator then left the treatment room, closed the door, and sat at the control terminal.

The operator had held this job for some time, and her typing efficiency had increased with experience. She could quickly enter prescription data and change it conveniently with the Therac's editing features. She entered the patient's prescription data quickly, then noticed that for mode she had typed "x" (for X ray) when she had intended "e" (for electron). This was a common mistake since most treatments involved X rays, and she had become accustomed to typing this. The mistake was easy to fix; she merely used the cursor up key to edit the mode entry.

Since the other parameters she had entered were correct, she hit the return key several times and left their values unchanged. She reached the bottom of the screen where a message indicated that the parameters had been "verified" and the terminal displayed "beam ready," as expected. She hit the one-key command "B" (for "beam on") to begin the treatment. After a moment, the machine shut down and the console displayed the message "Malfunction 54." The machine also displayed a "treatment pause," indicating a problem of low priority (see the operator interface sidebar). The sheet on the side of the machine explained that this malfunction was a "dose input 2" error. The ETCC did not have any other information available in its instruction manual or other Therac-25 documentation to explain the meaning of Malfunction 54. An AECL technician later testified that "dose input 2" meant that a dose had been delivered that was either too high or too low.

The machine showed a substantial underdose on its dose monitor display: 6 monitor units delivered, whereas the operator had requested 202 monitor units. The operator was accustomed to the quirks of the machine, which would frequently stop or delay treatment. In the past, the only consequences had been inconvenience. She immediately took the normal action when the machine merely paused, which was to hit the "P" key to proceed with the treatment. The machine promptly shut down with the same "Malfunction 54" error and the same underdose shown by the display terminal.

The operator was isolated from the patient, since the machine apparatus was inside a shielded room of its own. The only way the operator could be alerted to patient difficulty was through audio and video monitors. On this day, the video display was unplugged and the audio monitor was broken.

After the first attempt to treat him, the patient said that he felt like he had received an electric shock or that someone had poured hot coffee on his back: He felt a thump and heat and heard a buzzing sound from the equipment. Since this was his ninth treatment, he knew that this was not normal. He began to get up from the treatment table to go for help. It was at this moment that the operator hit the "P" key to proceed with the treatment. The patient said that he felt like his arm was being shocked by electricity and that his hand was leaving his body. He went to the treatment room door and pounded on it. The operator was shocked and immediately opened the door for him. He appeared shaken and upset.

The patient was immediately examined by a physician, who observed intense erythema over the treatment area, but suspected nothing more serious than electric shock. The patient was discharged with instructions to return if he suffered any further reactions. The hospital physicist was called in, and he found the machine calibration within specifications. The meaning of the malfunction message was not understood. The machine was then used to treat patients for the rest of the day.

In actuality, but unknown to anyone at that time, the patient had received a massive overdose, concentrated in the center of the treatment area. After-the-fact simulations of the accident revealed possible doses of 16,500 to 25,000 rads in less than 1 second over an area of about 1 cm.

During the weeks following the accident, the patient continued to have pain in his neck and shoulder. He lost the function of his left arm and had periodic bouts of nausea and vomiting. He was eventually hospitalized for radiation-induced myelitis of the cervical cord causing paralysis of his left arm and both legs, left vocal cord paralysis (which left him unable to speak), neurogenic bowel and bladder, and paralysis of the left diaphragm. He also had a lesion on his left lung and recurrent herpes simplex skin infections. He died from complications of the overdose five months after the accident.

User and manufacturer response. The Therac-25 was shut down for testing the day after this accident. One local AECL engineer and one from the home office in Canada came to ETCC to investigate. They spent a day running the machine through tests but could not reproduce a Malfunction 54. The AECL home office engineer reportedly explained that it was not possible for the Therac-25 to overdose a patient. The ETCC physicist claims that he asked AECL at this time if there were any other reports of radiation overexposure and that the AECL personnel (including the quality assurance manager) told him that AECL knew of no accidents involving radiation overexposure by the Therac-25. This seems odd since AECL was surely at least aware of the Hamilton accident that had occurred seven months before and the Yakima accident, and, even by its own account, AECL learned of the Georgia lawsuit about this time (the suit had been filed four months earlier). The AECL engineers then suggested that an electrical problem might have caused this accident.

The electric shock theory was checked out thoroughly by an independent engineering firm. The final report indicated that there was no electrical grounding problem in the machine, and it did not appear capable of giving a patient an electrical shock. The ETCC physicist checked the calibration of the Therac-25 and found it to be satisfactory. The center put the machine back into service on April 7, 1986, convinced that it was performing properly.

East Texas Cancer Center, April 1986. Three weeks after the first ETCC accident, on Friday, April 11, 1986, another male patient was scheduled to receive an electron treatment at ETCC for a skin cancer on the side of his face. The prescription was for 10 MeV to an area of approximately 7 x 10 cm. The same technician who had treated the first Tyler accident victim prepared this patient for treatment. Much of what follows is from the deposition of the Tyler Therac-25 operator.

As with her former patient, she entered the prescription data and then noticed an error in the mode. Again she used the cursor up key to change the mode from X ray to electron. After she finished editing, she pressed the return key several times to place the cursor on the bottom of the screen. She saw the "beam ready" message displayed and turned the beam on.

Within a few seconds the machine shut down, making a loud noise audible via the (now working) intercom. The display showed Malfunction 54 again. The operator rushed into the treatment room, hearing her patient moaning for help. The patient began to remove the tape that had held his head in position and said something was wrong. She asked him what he felt, and he replied "fire" on the side of his face. She immediately went to the hospital physicist and told him that another patient appeared to have been burned. Asked by the physicist to describe what he had experienced, the patient explained that something had hit him on the side of the face, he saw a flash of light, and he heard a sizzling sound reminiscent of frying eggs. He was very agitated and asked, "What happened to me, what happened to me?"

This patient died from the overdose on May 1, 1986, three weeks after the accident. He had disorientation that progressed to coma, fever to 104 degrees Fahrenheit, and neurological damage. Autopsy showed an acute high-dose radiation injury to the right temporal lobe of the brain and the brain stem.

User and manufacturer response. After this second Tyler accident, the ETCC physicist immediately took the machine out of service and called AECL to alert the company to this second apparent overexposure. The Tyler physicist then began his own careful investigation. He worked with the operator, who remembered exactly what she had done on this occasion. After a great deal of effort, they were eventually able to elicit the Malfunction 54 message. They determined that data-entry speed during editing was the key factor in producing the error condition: If the prescription data was edited at a fast pace (as is natural for someone who has repeated the procedure a large number of times), the overdose occurred.

It took some practice before the physicist could repeat the procedure rapidly enough to elicit the Malfunction 54 message at will. Once he could do this, he set about measuring the actual dose delivered under the error condition. He took a measurement of about 804 rads but realized that the ion chamber had become saturated. After making adjustments to extend his measurement ability, he determined that the dose was somewhere over 4,000 rads.

The next day, an engineer from AECL called and said that he could not reproduce the error. After the ETCC physicist explained that the procedure had to be performed quite rapidly, AECL could finally produce a similar malfunction on its own machine. AECL then set up its own set of measurements to test the dosage delivered. Two days after the accident, AECL said they had measured the dosage (at the center of the field) to be 25,000 rads. An AECL engineer explained that the frying sound heard by the patient was the ion chambers being saturated.

In fact, it is not possible to determine the exact dose each of the accident victims received; the total dose delivered during the malfunction conditions was found to vary enormously when different clinics simulated the faults. The number of pulses delivered in the 0.3 second that elapsed before interlock shutoff varied because the software adjusted the start-up pulse-repetition frequency to very different values on different machines. Therefore, there is still some uncertainty as to the doses actually received in the accidents.[1]

In one lawsuit that resulted from the Tyler accidents, the AECL quality control manager testified that a "cursor up" problem had been found in the service mode at the Kennestone clinic and one other clinic in February or March 1985 and also in the summer of 1985. Both times, AECL thought that the software problems had been fixed. There is no way to determine whether there is any relationship between these problems and the Tyler accidents.

Related Therac-20 problems. After the Tyler accidents, Therac-20 users (who had heard informally about the Tyler accidents from Therac-25 users) conducted informal investigations to determine whether the same problem could occur with their machines. As noted earlier, the software for the Therac-25 and Therac-20 both "evolved" from the Therac-6 software. Additional functions had to be added because the Therac-20 (and Therac-25) operates in both X-ray and electron mode, while the Therac-6 has only X-ray mode. The CGR employees modified the software for the Therac-20 to handle the dual modes.

When the Therac-25 development began, AECL engineers adapted the software from the Therac-6, but they also borrowed software routines from the Therac-20 to handle electron mode. The agreements between AECL and CGR gave both companies the right to tap technology used in joint products for their other products.

After the second Tyler accident, a physicist at the University of Chicago Joint Center for Radiation Therapy heard about the Therac-25 software problem and decided to find out whether the same thing could happen with the Therac-20. At first, the physicist was unable to reproduce the error on his machine, but two months later he found the link.

The Therac-20 at the University of Chicago is used to teach students in a radiation therapy school conducted by the center. The center's physicist, Frank Borger, noticed that whenever a new class of students started using the Therac-20, fuses and breakers on the machine tripped, shutting down the unit. These failures, which had been occurring ever since the center had acquired the machine, might appear three times a week while new students operated the machine and then disappear for months. Borger determined that new students make lots of different types of mistakes and use "creative methods of editing" parameters on the console. Through experimentation, he found that certain editing sequences correlated with blown fuses and determined that the same computer bug (as in the Therac-25 software) was responsible. The physicist notified the FDA, which notified Therac-20 users.[4]

The software error is just a nuisance on the Therac-20 because this machine has independent hardware protective circuits for monitoring the electron-beam scanning. The protective circuits do not allow the beam to turn on, so there is no danger of radiation exposure to a patient. While the Therac-20 relies on mechanical interlocks for monitoring the machine, the Therac-25 relies largely on software.

The software problem. A lesson to be learned from the Therac-25 story is that focusing on particular software bugs is not the way to make a safe system. Virtually all complex software can be made to behave in an unexpected fashion under certain conditions. The basic mistakes here involved poor software-engineering practices and building a machine that relies on the software for safe operation. Furthermore, the particular coding error is not as important as the general unsafe design of the software overall. Examining the part of the code blamed for the Tyler accidents is instructive, however, in showing the overall software design flaws. The following explanation of the problem is from the description AECL provided for the FDA, although we have tried to clarify it somewhat. The description leaves some unanswered questions, but it is the best we can do with the information we have.

As described in the sidebar on Therac-25 software development and design, the treatment monitor task (Treat) controls the various phases of treatment by executing its eight subroutines (see Figure 2). The treatment phase indicator variable (Tphase) is used to determine which subroutine should be executed. Following the execution of a particular subroutine, Treat reschedules itself.

One of Treat's subroutines, called Datent (data entry), communicates with the keyboard handler task (a task that runs concurrently with Treat) via a shared variable (Data-entry completion flag) to determine whether the prescription data has been entered. The keyboard handler recognizes the completion of data entry and changes the Data-entry completion variable to denote this. Once the Data-entry completion variable is set, the Datent subroutine detects the variable's change in status and changes the value of Tphase from 1 (Data Entry) to 3 (Set-Up Test). In this case, the Datent subroutine exits back to the Treat subroutine, which will reschedule itself and begin execution of the Set-Up Test subroutine. If the Data-entry completion variable has not been set, Datent leaves the value of Tphase unchanged and exits back to Treat's main line. Treat will then reschedule itself, essentially rescheduling the Datent subroutine.

The command line at the lower right corner of the screen is the cursor's normal position when the operator has completed all necessary changes to the prescription. Prescription editing is signified by cursor movement off the command line. As the program was originally designed, the Data-entry completion variable by itself is not sufficient since it does not ensure that the cursor is located on the command line. Under the right circumstances, the data-entry phase can be exited before all edit changes are made on the screen.

The keyboard handler parses the mode and energy level specified by the operator and places an encoded result in another shared variable, the 2-byte mode/energy offset (MEOS) variable. The low-order byte of this variable is used by another task (Hand) to set the collimator/turntable to the proper position for the selected mode/energy. The high-order byte of the MEOS variable is used by Datent to set several operating parameters.

Initially, the data-entry process forces the operator to enter the mode and energy, except when the operator selects the photon mode, in which case the energy defaults to 25 MeV. The operator can later edit the mode and energy separately. If the keyboard handler sets the data-entry completion variable before the operator changes the data in MEOS, Datent will not detect the changes in MEOS since it has already exited and will not be reentered again. The upper collimator, on the other hand, is set to the position dictated by the low-order byte of MEOS by another concurrently running task (Hand) and can therefore be inconsistent with the parameters set in accordance with the information in the high-order byte of MEOS. The software appears to include no checks to detect such an incompatibility.

The first thing that Datent does when it is entered is to check whether the mode/energy has been set in MEOS. If so, it uses the high-order byte to index into a table of preset operating parameters and places them in the digital-to-analog output table. The contents of this output table are transferred to the digital-analog converter during the next clock cycle. Once the parameters are all set, Datent calls the subroutine Magnet, which sets the bending magnets. Figure 3 is a simplified pseudocode description of relevant parts of the software.

Setting the bending magnets takes about 8 seconds. Magnet calls a subroutine called Ptime to introduce a time delay. Since several magnets need to be set, Ptime is entered and exited several times. A flag to indicate that bending magnets are being set is initialized upon entry to the Magnet subroutine and cleared at the end of Ptime. Furthermore, Ptime checks a shared variable, set by the keyboard handler, that indicates the presence of any editing requests. If there are edits, then Ptime clears the bending magnet variable and exits to Magnet, which then exits to Datent. But the edit change variable is checked by Ptime only if the bending magnet flag is set. Since Ptime clears it during its first execution, any edits performed during each succeeding pass through Ptime will not be recognized. Thus, an edit change of the mode or energy, although reflected on the operator's screen and the mode/energy offset variable, will not be sensed by Datent so it can index the appropriate calibration tables for the machine parameters.