Companies Tap Contingency Plans as AT&T's Frame Relay Network Crashes [Shouldn't this drive more sales of ASND frame relay equipment?]
(04/20/98; 4:33 p.m. ET) By Kate Gerwig, InternetWeek
techweb.com
Network backup, like life insurance, is something you buy, but hope you'll never use. Last week, many AT&T frame relay customers found out what their backup systems were made of.
"This isn't exactly the way we wanted to find out that our backup systems are excellent," said a MasterCard International spokesman after his company recovered from the nationwide failure of AT&T's InterSpan frame relay network.
After MasterCard got over the initial shock from a network failure that stranded 23,000-member financial institutions around the world, it got some good news. "Investing lots of money in multiple backup systems was exactly the right thing to do," the MasterCard spokesman said, summing up the 24-hour disaster that left an estimated 6,600 AT&T frame relay customers firing traffic into a dead network.
MasterCard relies on AT&T's frame relay network to process $600 billion in credit card transactions each year -- or $1.64 billion a day. It is only one of thousands of business customers that rely on AT&T's data network for mission-critical applications.
MasterCard was luckier than many others.
Wells Fargo Bank, which uses AT&T frame service in California to connect 1,070 branches, found that one-half were without working automated teller machines, phones, and computer terminals after the failure. The 500 branches that rely on an AT&T-provided ISDN backup system continued working, although slowly. While Wells Fargo worked with AT&T to bring its network back to life, it also called in MCI and Roseville Telephone to install 12 T1s to restore service in 350 of its California sites.
Some new AT&T frame relay customers were left pretty rattled by the experience. A publishing company that chose AT&T less than a month ago said AT&T responded to the outage quickly, but there were problems switching to the AT&T-provided ISDN backup service.
"The ISDN connections were unavailable almost as long as the frame network was down," said the publishing company's network manager, who asked not to be identified. "We won't move away from frame right now. We were happy before the crash. But we'll keep an eye on new technologies that might be more reliable."
The AT&T crash may lead some businesses to use different providers for redundant and backup connections.
"A massive outage like this is highly unusual, but it has bitten a lot of people," said Chris Luise, chief technical officer of Skandia AFS, a worldwide insurance and financial services company, and an MCI user. "Anyone who has a mission-critical network that is running without a backup is flying without a net."
Those safeguards may appear outdated with today's self-healing backbones and Sonet rings. "It's just not a strategy that people subscribe to anymore. It's put all your eggs into one basket and get the lowest price you can," Luise said.
The mysterious 24-hour crash will go down in the record books as one of the biggest and longest.
Michael Armstrong, AT&T's CEO, publicly apologized and dispatched service reps to work at customer sites night and day. He also sent hand-delivered letters to the CEOs of all of AT&T's frame customers the morning after the failure. Customers with heavy network dependence were given 15-minute updates on network status, Armstrong said.
"It's been a very difficult 20 hours for our customers," Armstrong said last Tuesday. "We let our customers down, and I want to apologize to each and every one of them. We will apply every resource we can to fix this problem."
Frank Ianna, AT&T executive vice president of network and computing services, said the problem started with two of the 145 StrataCom BPX 8600 switches in the AT&T frame network. StrataCom is now owned by Cisco.
AT&T confirmed that the problem switches are in Cambridge, Mass., and Albany, N.Y.
"The problem was probably from some type of hardware, software, or a combination of the two," Ianna said. "It began to spread through the network and impact virtually every node on the network."
Ianna admitted that StrataView, the BPX network management system used by AT&T, did not alert operations personnel to the problem.
"This is the first time [the InterSpan network has] ever gone down," said Rick Malone, a principal at Vertical Systems. "Certain industries live and die on these networks. They have airline networks that have thousands of sites. If all of these sites are down, it's panic time."
No matter what the cause, Malone gives AT&T high marks for bringing 100,000 ports back up in 24 hours. "This is the service provider's total nightmare -- a network that shuts down by itself. This was a fatal problem, and they brought it back up."
AT&T has such faith in its frame relay network that only in January did it begin offering new service-level agreements (SLA) that guarantee 99.99 percent network availability. Customers who signed on since January receive credits commensurate with their network size if the network goes down. If a customer's permanent virtual circuits are lost and are not restored within four hours, they get the affected ports and PVCs free for a month.
Many frame relay customers are not covered by the new SLAs, however, and have negotiated individual SLAs with AT&T. Regardless of the SLA in effect, Armstrong said, the company will not charge customers for frame relay service until it fixes the problem and can provide better guarantees that it will not happen again.
"Frankly, AT&T has one of the best SLAs in the industry right now," said TeleChoice analyst Christine Heckart.
"The cause of the problem will probably be resolved by the end of the month, but there's no way to guarantee that. For Armstrong to say they weren't going to charge customers until it is resolved is not necessarily low risk," she said.
Heckart said AT&T's handling of the outage sets a new benchmark in the industry for crisis management and has customers giving AT&T high marks for responsiveness.
"Every customer knows they will work with a carrier that will have problems. So what they have to look for, is how the carrier responds during the problem," she said. |