June 22, 1998, TechWeb News
-------------------------------------------------------------------------------- Memory Technology -- Sees need to protect against 'soft' errors in ICs -- IBM moves to guard DRAM from cosmic invaders By Anthony Cataldo
Fishkill, N.Y. - Somewhere in a distant galaxy, a dying star heaves a brilliant pulse of cosmic radiation, projecting a shower of protons and neutrons that spiral in every direction until, eventually, some of them penetrate our atmosphere and bombard the earth indiscriminately. Some of the particles smash into a bank of servers somewhere, causing the network to screech to a halt and jolting an IT manager out of bed.
Such a scenario may not be as farfetched as it sounds. Cosmic rays, which for several decades have been observed as the cause of "soft" errors in ICs, could become more of a problem for the new breed of workstations and servers with large banks of DRAM modules. As memory density and die sizes enlarge, so do the chances that memory failures will occur that are outside the system designer's control, some say.
IBM Corp., which has studied the soft-error phenomenon for years, is particularly vocal in preaching the need to protect against multiple-bit soft errors, and is even offering a memory module with a special ASIC attached that corrects them.
Though soft errors have barely entered the industry's collective consciousness, they could prove pernicious. When it is most forgiving, a multiple-bit memory error may cause an operating system to crash. At its worst, it can add another zero to an employee's paycheck, some experts say.
Until about six years ago, soft errors were less of a concern than "hard" errors, which are caused by contaminants that lodge themselves in the gate oxide of a DRAM cell during manufacturing. Since then, DRAM manufacturers have improved their production techniques and packages so that those so-called alpha particles are no longer considered a problem in modern fabs.
But better manufacturing could not protect against soft errors, which by the late 1970s were being observed in some esoteric circles as causing a DRAM cell's capacitor to flip.
Two years ago, a team at IBM disclosed the results of a study conducted over more than 10 years, starting in the late 1970s, on the effect of cosmic rays on DRAMs.
They tested about 800 devices in constant read mode at sea level, in mountainous regions and in underground caves.
What they found was that the higher the altitude, the more numerous the soft errors-presumably because there are fewer air molecules at higher altitudes to absorb cosmic rays. Adding credence to their theory was that after three months, the underground DRAM tested at zero soft errors.
One soft error/month
"This clearly indicates that because of cosmic rays, for every 256 Mbytes of memory, you'll get one soft error a month," said Tim Dell, senior design engineer for IBM Microelectronics. "The same phenomenon for hard errors with multiple bits will also come into effect with soft errors."
Error correction has always been a fixture in mainframes, which can't tolerate bit errors, lest their data become corrupted and cause errors in records and other mission-critical data types. Now, with new chip sets-such as Intel Corp.'s 440BX-that enable up to 1 Gbyte of DRAM per system, OEMs are packing their servers and workstations with as much DRAM as once packed the mainframes of yore.
"We see that at 1 Gbyte, we're at a point where you need the robust error correction that the mainframes had. We're using that as the modeling breakpoint," Dell said.
But he added that single-bit error correction, which is now prevalent in all but the lowest-end servers, will no longer be enough to ensure data reliability. "Ten to 15 years ago, all DRAMs were x1. If one DRAM failed you would only lose 1 bit out of the ECC [error correction code] word, so a single-bit ECC was fine," Dell said. "The market is now moving into x8- and x16-wide devices, so there's a certain percentage of DRAM failure rate that will cause more than 1 bit to go wrong."
Besides widths, soaring device densities make today's DRAMs more susceptible to bit errors. And soon, protocol-based DRAMs-such as Direct Rambus and SL-DRAMs, which employ a narrow bus but run at high frequencies-will create more chances of bit errors, either because of the noise induced from their sheer speed or because of the device complexity, observers said.
"The 64 bits come out in bursts of 8, and every single bit in an ECC word can come out of the same chip," Dell said. "When you go to a narrow protocol, you have to add more logic in the device and more circuitry and [higher] potential rates of failure. There's a greater chance that a single defect can cause more than 1 bit to fail."
Indeed, Intel said it is aware of the "chip kill" phenomenon that occurs in the new breed of Direct Rambus DRAMs. Nevertheless, there are performance trade-offs with SDRAMs that make the ultra-high bandwidth Direct RDRAMs an attractive solution.
"You can't recover without a hiccup from a dead device, but once you detect a device that has failed you can go back and configure the memory. But with SDRAMs, you lose parallel access," said Pete MacWilliams, Intel fellow and director of platform architecture for Intel Architecture Labs.
Others agree that the potential for bit failure is a growing problem. "It is true that when a customer does a mean-time-to-failure calculation, the failure rate is directly proportional to the amount of DRAM in the system," said Kevin Kilbuck, technical marketing manager for Toshiba America Electronic Components Inc. (Irvine, Calif.). "It depends not only on the system but on the application.
"In desktop PCs, even though you have 64 or 128 Mbytes of memory, when you get an error you may not notice it or have to reboot. [But] if it's a transaction-sensitive application for a bank, they can't tolerate errors. It's driven by the amount of memory in the system and whether it's a mission-critical application."
Under current setups, designers usually use error correction code in the memory controller supported by memory modules with an extra chip dedicated to ECC. Toshiba, for example, sells ECC-enabled 8-Mbit x 72 dual-in-line memory modules (DIMMs) populated with nine pieces of 8-Mbit x 8 DRAMs, with one of the DRAMs functioning for ECC.
If multiple-bit errors become a problem, DRAM vendors could also offer more error-correcting bits by offering x80 DIMMs, Kilbuck said.
Another option is to build ECC bits into the DRAM device itself, but DRAM vendors say most customers scoff at such an approach when told that it will raise the price of the DRAM considerably.
That's one of the reasons IBM argues that providing DIMMs with error-correcting algorithms in a hardwired ASIC is the best way for customers to protect their data. For one, off-the-shelf chip sets today don't have the ability to safeguard against multiple-bit errors, leaving companies with the option of building custom memory controllers with the capability.
IBM's new 3.3-V, 168-pin modules with the multiple-bit error-correction feature aren't cheap; on average they will cost about 50 percent more than a standard module. At present IBM is only offering the mod-ules with extended-data-out (EDO) DRAM, though the company is developing modules for SDRAMs and protocol-based DRAMs, such as Direct Rambus, that won't take a performance hit due to their extra error-correction logic.
But for now, IBM claims the modules provide the path of least resistance to combat the threat of soft errors, because they can be plugged directly into existing DIMM slots.
Perhaps IBM's biggest challenge is to convince the industry that soft errors caused by sub-atomic particles from outer space are a real problem to reckon with. Dell, who has written and lectured extensively on the subject, acknowledged that many DRAM producers and customers are reluctant to accept that soft errors pose much of a threat; the errors are often explained away as voltage spikes or are blamed on unstable software. And while DRAM manufacturers have put reliability testing in place to screen for alpha particles, few have any testing methodology to account for soft errors caused by cosmic rays, he noted.
And although customers are starting to request more information about preventing multiple-bit DRAM soft errors, Toshiba's Kilbuck said the issue hasn't yet become a priority.
"They worry about it because ECC is not going to fix it, but then they ask us what the total failure is and whether it's single-bit or multiple-bit. Ninety-nine percent of the time, it's a single bit error," he said. "It doesn't seem like double-bit errors are becoming more of a concern," Kilbuck said. "But maybe IBM has figured out something that we don't know yet."
reh |