AT&T IDENTIFIES ROOT CAUSE OF FRAME RELAY NETWORK MELTDOWN, PROMISES FIX [BROADBAND NETWORKING NEWS 04/28]
A software update procedure that AT&T [T] technicians performed on a line card within a frame relay switch, conspired with a software flaw in the line card and a separate bug in the switch software, to bring down AT&T's frame relay network. The April 13, network outage - shortly after BROADBAND NETWORKING NEWS went to press - left 6,600 AT&T customers without frame relay service for 24 hours or more. AT&T immediately went into damage control. In a letter, and at a press conference on April 14, AT&T Chairman C. Michael Armstrong apologized to AT&T's customers and announced that the carrier would forego charges for frame relay service until it isolated and confirmed the root cause of this problem. After a week of official silence, Armstrong retook the podium last week to state that AT&T had definitely identified the root cause, and knew how to fix it. The outage has been isolated to a pair of software flaws within the Cisco Systems [CSCO] Stratacom frame relay product, which were antagonized by a faulty software update. AT&T had little to say about what its technicians were doing, or why they chose a busy Monday afternoon, other than repeatedly refer to "a procedure" that Frank Ianna, network services executive described as "inadequate." The problem began as the procedure triggered a flaw in the circuit card's internal software, which began a stream of false messages. Unfortunately, a separate problem in the frame relay switch software made that and all 145 other frame relay switches unable to recognize the messages as garbage. "One particular procedure in one switch, coupled with that software flaw, started the looping of messages," said Ianna. "Then the switch was not able to recognize that it was sending out this storm of messages to the other switch, and hence all the switches became overloaded," he added.
...Repairs Now Underway
Ianna reported that AT&T has in hand the firmware for the circuit card to prevent a reoccurrence, and it was in the process of identifying and collaborating with Cisco Systems the final changes needed to the frame relay switch software. Cisco Systems did not participate in the press conference. Ironically the switch that started the "message storm" was off- line at the time of the incident, but was still interconnected on a trunk basis so that the network control messages did go through. AT&T performed its update procedure in the middle of a busy afternoon, rather than at night, because it believed any potential failure would remain with one trunk card carrying no customer service. Armstrong said that AT&T was "probably a couple of days" from a finalized solution to the problem that it would share in detail with its customers. At that time, it would reinstitute network charges. He did not offer any estimates of the cost of the experience to AT&T. "It will not require a charge against earnings. It will be very immaterial," he said. Vertical Systems Group reports that AT&T has a 39.3 percent share of the U.S. frame relay services market, a market worth $2.54 billion last year, and estimated to grow to $4.02 billion in 1998. At $4 billion a year, until the carrier can charge its customers, the outage could cost AT&T $4 million a day. Armstrong would say that the figure is less than such a worst case scenario due to the ramp rate on the revenue. "The [frame relay] network, depending upon our revenue growth going forward, is somewhere around a $900 million to a billion dollar business, and we have had a several week outage," he offered. (Adele Ambrose, AT&T, 908/221-6900) |