SI
SI
discoversearch

We've detected that you're using an ad content blocking browser plug-in or feature. Ads provide a critical source of revenue to the continued operation of Silicon Investor.  We ask that you disable ad blocking while on Silicon Investor in the best interests of our community.  If you are not using an ad blocker but are still receiving this message, make sure your browser's tracking protection is set to the 'standard' level.
Technology Stocks : Rambus (RMBS) - Eagle or Penguin
RMBS 94.82+2.7%Nov 26 3:59 PM EST

 Public ReplyPrvt ReplyMark as Last ReadFilePrevious 10Next 10PreviousNext  
To: Bilow who wrote (30849)9/28/1999 12:07:00 PM
From: John Walliker  Read Replies (3) of 93625
 
Carl,

Some thoughts on the Rambus fiasco, prompted by the question of why the
problems were not detected earlier.

Suppose that there is a timing violation which occurs when the electrical
distance between the 820 controller and one of the memory chips is an exact
multiple of the clock period greater than 1 (plus some small fixed constant
perhaps).

The metastability window in which this causes random errors might only be,
say, 10 picoseconds wide, corresponding to a physical distance of 1.5mm.

If only two RIMMs are installed the total delay is always less than
0.13 + 2.1 + 0.07 + 2.1 = 4.4ns
based on 2cm track from 820 to first RIMM, 1cm between RIMMs and the maximum
propagation delay for a 16 device RIMM.

This delay is always less than two clock periods (5ns to 6.66ns for 800 to
600 MHz baud rate). Therefore the errors will never happen under any conditions
if only the first two slots are used.

Now consider the configuration with two end slots used and a continuity
module in the middle one.

Assuming the same numbers as before, and the 0.88ns maximum for a
continuity module, the delay to the leading edge of the third RIMM is
0.13 + 2.1 + 0.07 + 0.88 + 0.07 = 3.25ns
and the delay to the end of the RIMM in the third slot will be
3.25 + 2.1 = 5.35ns.
This means that the critical distance comes somewhere within the RIMM in
slot 3 for 800Mbaud.

Now suppose that the first slot has a 16 device RIMM and the other slots each
have 8 device RIMMs (32 devices is the maximum allowed).

The delay range for the third device is now
0.13 + 2.1 + 0.07 + 1.5 + 0.07 = 3.87 to
3.87 + 1.5 = 5.37ns.

Approximately the same result is obtained.

Hence, according to this hypothesis, for various combinations
of RIMMs and continuity modules consistent with press reports
there are situations where a timing error might affect the system at
800Mbaud and possibly at 700Mbaud if PCB tracks are slightly longer
than I assumed.

How likely is it that such a fault would be detected?

The propagation delay through the third RIMM is 1.5 to 2.1 ns max
(for 8chip and 16 chip RIMMs). Therefore for RIMMs with worst case
propagation delays the probability is
8 * 10ps / 1500ps = 5.3%
or
16 * 10ps / 2100ps = 7.6%
at each clock frequency and combination of different RIMM layouts.

This assumes that if the critical distance does not correspond to any device
then no error will occur. These numbers are only approximate.

(Use of spread spectrum clocking would increase the probability of errors
occurring at all, but would greatly reduce their frequency.)

My conclusion from this is that even though a number of combinations of devices
may have been tested exhaustively, the probability of failing to find this
hypothetical fault is high.

Now apply Occam's razor and assume that the fault also occurs at any multiple
of the clock period including 1 period.

In this case, the critical distance could fall within the bounds of the
second slot, but because there are fewer possible combinations of RIMMs in
two slots than three there is a higher probability that the effect has never
been detected.

How to test this hypothesis?

Drive the system clock with a VERY slow frequency sweep from 300 to 400MHz over
a period of many hours while running memory test software. This would slowly
move the critical point along the bus, past the terminals of each device in
turn and greatly increasing the probability of detection.

Alternatively, make a batch of perhaps 50 continuity modules each with a
slightly (1mm say) longer electrical length than the next and use them in turn
in the first or second slot.

Just to make it harder, the effect may well not happen for simple memory
operations, but only in special cases such as turning the bus round between
reads and writes when an extra cycle delay is added in some situations, or maybe
when bursts are aborted partway through.

If the first hypothesis is correct, then restricting the i820 to two slots
would provide a complete cure.

If the second is correct, then the i820 should be modified to be sure that some
as-yet untested RIMM combination will not cause problems in the future.

In no case would applications like the Playstation II or most notebook PCs
be affected because they would have much shorter bus lengths than even one
clock period.

Note that none of this relates to signal quality on the bus, which can much
more readily be verified. It is purely a design issue in the controller
logic. It is also purely hypothetical, although I think plausible
at this stage.

John

Report TOU ViolationShare This Post
 Public ReplyPrvt ReplyMark as Last ReadFilePrevious 10Next 10PreviousNext