Carl,
Some thoughts on the Rambus fiasco, prompted by the question of why the problems were not detected earlier.
Suppose that there is a timing violation which occurs when the electrical distance between the 820 controller and one of the memory chips is an exact multiple of the clock period greater than 1 (plus some small fixed constant perhaps).
The metastability window in which this causes random errors might only be, say, 10 picoseconds wide, corresponding to a physical distance of 1.5mm.
If only two RIMMs are installed the total delay is always less than 0.13 + 2.1 + 0.07 + 2.1 = 4.4ns based on 2cm track from 820 to first RIMM, 1cm between RIMMs and the maximum propagation delay for a 16 device RIMM.
This delay is always less than two clock periods (5ns to 6.66ns for 800 to 600 MHz baud rate). Therefore the errors will never happen under any conditions if only the first two slots are used.
Now consider the configuration with two end slots used and a continuity module in the middle one.
Assuming the same numbers as before, and the 0.88ns maximum for a continuity module, the delay to the leading edge of the third RIMM is 0.13 + 2.1 + 0.07 + 0.88 + 0.07 = 3.25ns and the delay to the end of the RIMM in the third slot will be 3.25 + 2.1 = 5.35ns. This means that the critical distance comes somewhere within the RIMM in slot 3 for 800Mbaud.
Now suppose that the first slot has a 16 device RIMM and the other slots each have 8 device RIMMs (32 devices is the maximum allowed).
The delay range for the third device is now 0.13 + 2.1 + 0.07 + 1.5 + 0.07 = 3.87 to 3.87 + 1.5 = 5.37ns.
Approximately the same result is obtained.
Hence, according to this hypothesis, for various combinations of RIMMs and continuity modules consistent with press reports there are situations where a timing error might affect the system at 800Mbaud and possibly at 700Mbaud if PCB tracks are slightly longer than I assumed.
How likely is it that such a fault would be detected?
The propagation delay through the third RIMM is 1.5 to 2.1 ns max (for 8chip and 16 chip RIMMs). Therefore for RIMMs with worst case propagation delays the probability is 8 * 10ps / 1500ps = 5.3% or 16 * 10ps / 2100ps = 7.6% at each clock frequency and combination of different RIMM layouts.
This assumes that if the critical distance does not correspond to any device then no error will occur. These numbers are only approximate.
(Use of spread spectrum clocking would increase the probability of errors occurring at all, but would greatly reduce their frequency.)
My conclusion from this is that even though a number of combinations of devices may have been tested exhaustively, the probability of failing to find this hypothetical fault is high.
Now apply Occam's razor and assume that the fault also occurs at any multiple of the clock period including 1 period.
In this case, the critical distance could fall within the bounds of the second slot, but because there are fewer possible combinations of RIMMs in two slots than three there is a higher probability that the effect has never been detected.
How to test this hypothesis?
Drive the system clock with a VERY slow frequency sweep from 300 to 400MHz over a period of many hours while running memory test software. This would slowly move the critical point along the bus, past the terminals of each device in turn and greatly increasing the probability of detection.
Alternatively, make a batch of perhaps 50 continuity modules each with a slightly (1mm say) longer electrical length than the next and use them in turn in the first or second slot.
Just to make it harder, the effect may well not happen for simple memory operations, but only in special cases such as turning the bus round between reads and writes when an extra cycle delay is added in some situations, or maybe when bursts are aborted partway through.
If the first hypothesis is correct, then restricting the i820 to two slots would provide a complete cure.
If the second is correct, then the i820 should be modified to be sure that some as-yet untested RIMM combination will not cause problems in the future.
In no case would applications like the Playstation II or most notebook PCs be affected because they would have much shorter bus lengths than even one clock period.
Note that none of this relates to signal quality on the bus, which can much more readily be verified. It is purely a design issue in the controller logic. It is also purely hypothetical, although I think plausible at this stage.
John
|