SI
SI
discoversearch

We've detected that you're using an ad content blocking browser plug-in or feature. Ads provide a critical source of revenue to the continued operation of Silicon Investor.  We ask that you disable ad blocking while on Silicon Investor in the best interests of our community.  If you are not using an ad blocker but are still receiving this message, make sure your browser's tracking protection is set to the 'standard' level.
Technology Stocks : Rambus (RMBS) - Eagle or Penguin -- Ignore unavailable to you. Want to Upgrade?


To: Bilow who wrote (30728)9/26/1999 11:28:00 PM
From: Bilow  Read Replies (1) | Respond to of 93625
 
Hi Self; Here is my prediction for the cause of the difficulty: Camino has worse setup or hold times on the data bus inputs than what was required in the design. If I am right, they are going to have to turn the process, and you are in for a 3 month wait. This is my best guess based on the little hints that have been sent out to industry, and that we have all read, I certainly have no inside knowledge of any of this.

Please add "probably", "likely", and other mush words to all of this, because it really isn't anything other than educated (or more precisely, experienced) guess work. Anyway, here are some semi-random thoughts on what might be wrong...

(A) Since there is some mention of ECC being required, the problem must be on the data bus. A problem sending addressing information to the chip (or a problem on the CMOS interface) could not possibly be corrected with ECC, which only applies to single bit errors.

(B) Data bus problems can be either reads or writes. ( Note 1) On a system, it is usually easy to tell them apart, simply by rereading the bad bit, over and over, with various other things going on in the system. If the bit is always bad, the problem is probably a write. If the bit flips back and forth, it is definitely a read.
It would be great to know which type of problem they are having, but I really can't guess from what I have read so far.

(C) It has been said that the problem is exacerbated by attempts to address chips at the far end of the Rambus channel. This is in agreement with the suggestion I made, in the post linked to this one, that the problem is probably a race condition between data and clock. The farther the two signals are sent, the more randomness is added to their propagation delay. The randomness are caused by all sorts of things, like the slight differences in the noise environment experienced by the two traces, to slight differences in capacitance of all those 23 chips between the two ends of the channel. Eventually, given a long enough channel, you get enough random prop-delay difference to end up with a race violation.

(D) The above is consistent with the restriction to a smaller number of RIMMs, but it would also imply that Intel will put a restriction on the total number of RDRAM chips on those RIMMs. In actual fact, I'm not sure what the largest number of RDRAMs that can be currently had on a single RIMM is. But if that number is eight, then Intel is likely to put out a restriction on that number, not allowing it to go to 16 in a two-RIMM system.

(E) The thing that is odd, is that some reports state that merely "terminating" the third socket will not solve the problem. This is fascinating. It suggests that the channel length is significant. The channel length is going to have two consequences. First, because of small errors in part values, the longer line will have some ringing. You wouldn't think that this could have much of an effect, maybe it would if the Camino's input buffers are too sensitive to noise. Such a problem could arise if there was problems with the VRef signals. I have had some difficulty with understanding the VRef specs on DDR, nobody ever seems to specify the input impedance, capacitance, or current requirement of those pins...

(F) Evidently, according to part (A), the Camino drivers for control lines are working fine. A difference between those drivers and the data bus drivers, is that the data bus drivers have to be able to drive 14 ohms because the data bus is doubly terminated, while the control bus may only have to drive 28 ohms. I haven't bothered to check whether the Rambus Channel requires double termination on the control bus, it may well require it.

(G) But if I had to hazard a guess, I would say that it is likely that the Camino is failing to meet its data input setup or hold times. This would explain why they were able to quickly reduce the problem to the chip set, rather than the memory, and this is consistent with the problem being worse in the outlying RDRAMs, as they have the most random delay movement. The simulations must have shown that the setup and hold requirements were met, but the actual parts are out of spec. This could be due to just about anything, from noise in the reference voltage (which interacts with rise and fall times to create setup and hold violations), to differences in the way the Camino process scales delays through the data path to delays through the clock path. This is the problem I hinted at in the post this one replies to.

(H) On traditional designs, if the problem turns out to be a hold violation, then it occurs at all frequencies, but if it is a setup violation, then it becomes ameliorated at low enough frequencies. Since I haven't seen any suggestion that the problem can be cured by, for instance, using -800 parts in a 600MHz system, I can only suppose that they don't have a setup violation. But since this is one of those new-fangled DLL type designs, you really can't make this sort of conclusion. DLLs can make setup problems into hold problems and vice-versa. This reminds me of a song that design engineers used to sing at a place I worked. The first lyrics were:

"He couldn't meet his setup,
and he couldn't meet his hold
They forced him into marketing,
cause his stuff was just too bold."

(I) The cure will be a silicon turn, so get ready to wait somewhat longer than a board turn. As far as how long to wait, ask Process Boy. If the problem can be solved in metal, then it isn't terribly long before they have something... I would say that three months is an outside guess for how long this will delay RDRAM. Probably less.

Note on refresh problems:
(1). It is also possible to end up with data bus problems from a fault of the refresh circuitry. These are usually associated with particular bits in memory (cause those bits have smaller capacitors than normal, or are farther away from a sense amp, or whatever), and any particular bit always goes to the same state. That is, a bit might tend to go to zero, for instance. The one-way nature of this tendency is because the capacitor always discharges in the same direction. Note that because memory systems invert their data bits internally, you cannot predict whether the bit would tend to zero or one without intensive knowledge about the internals of the chip. On some chips, some rows of bits tend to zero, while other rows tend to one. This typically causes a video image to have a checkerboard pattern on graphics cards that are brought to full power before memory is cleared, by the way. But refresh problems are rare, I have only seen a refresh problem once in 18 years. In that system, we had forgotten completely to perform any refresh cycles at all on half the memory. Despite this sad oversight, the system worked most of the time. (This is what prototypes are for.)

Wow, wish I had a chance to work on the problem, I bet they are sweating pretty good. I love fixing things that are broken.

-- Carl