Carl,
No, it isn't some logic problem in the chip set. Those are the easiest problems in the world to catch, they don't wait up till the last minute. I've designed and debugged memory controllers for 18 years, and I hereby stake my reputation on the fact that this is not a simple logic problem.
I don't know - we'll see.
A possible cause (in the controller) that I would believe would be something having to do with the inherently long delays to the last few chips on a rambus channel. When data gets transferred between two clock domains, there is always the likelihood of screwing the design up. It is possible that their clock domain transfer logic doesn't work when the two clocks are too widely different.
The clock travels together with data, so I don't think the latching of data is the problem. The setup and hold times are 0.2ns for Samsung 128/144 devices. The cycle time for 800Mbaud RDRAM is 2.5ns, so each bit occupies a 1.25ns time window. As you point out at the end metastability has crucial importance and is always present, even if the probability of error is vanishingly small. Provided the setup and hold conditions are met, the device will meet specifications. The exponential relationship means that even adding a few picoseconds extra will further increase the reliability by a large amount. Conversely, failing to meet the setup time will cause random errors whose probability is exponentially related to the amount by which the specification is missed and the inherent speed of the data latch.
The setup time is the length of time before a clock edge that the data must be stable. The hold time is the time after that clock edge that the data must remain stable before it is allowed to change. The sum of these values defines a window during which data must be stable for it to be recorded accurately.
An assumption that is often made is that the time available for latching data in DDR devices is the same as for DRDRAMs. This would imply that the faster DRDRAM devices are inherently less reliable because of there being less timing margin available.
This is not true. Consider the IBM 256 Mbit DDR RAM you have mentioned previously. The setup and hold times are 0.075 * TCK, or 0.525 ns for the fastest grade device at a clock frequency of 143MHz. So the time window in which data must be stable is 1.05ns. Compare this with the 0.2 + 0.2 = 0.4ns time window needed by the older generation Samsung DRDRAM. The older DRDRAM actually latches data 2.6 times faster than the latest DDR RAM. Because of the exponential metastability relationship which will be much steeper for the faster DRDRAM data latches, less extra time is needed to provide an equivalent safety margin.
Furthermore, a substantial part of the available time window for DDR devices is going to be used up waiting for ringing of the bus signals to die down, even if an attempt is made to terminate the lines as well as possible. This is because of the inevitable presence of stubs in designs using DIMMs for multiple chips. Rambus, with a well controlled transmission line impedance and no long stubs largely avoids this wasted time. The use of constant current drivers by Rambus means that even the device driving the bus does not significantly alter its impedance when it is turned on, further enhancing the cleanliness of the signals. An additional difficulty faced by DDR RAM is that different bus signals are loaded by different numbers of device inputs. Therefore additional components are needed to equalise the bus loading and avoid timing skew between the different signals. With Rambus all signals have equal loads on them, avoiding this problem.
Suppose a connector ended up with a bit of filth that happened to cause an increase in resistance of 5 ohms. That will not affect the PC100 system, in fact, it may run even better. But it will completely throw the Rambus system into the weeds. The reason for this, is that the Rambus system has to be impedance matched, the PC100 does not. Basically, traditional, boring, slow, uninteresting, signal logic levels are reliable, profitable, and robust. The Rambus logic system is unreliable, expensive, and sensitive.
You have chosen a resistance for your example which might have that effect. However, the contact resistance of dirty connectors normally increases very abruptly after it has reached a few tenths of an ohm, so this difference would not be significant. An alternative viewpoint is that it is better to know quickly that the system is faulty so that the connectors can be cleaned rather than have some memory locations unreliable while others work perfectly and possibly no symptoms other than silent data corruption.
By the way, I remember a time when many more connectors were gold plated than currently. I don't know why they phased them out over the years, I suppose it must have been because connector wiper technology improved to the point where copper was reliable enough.
Bare copper is never used because it quickly forms a semiconducting oxide coating. (Remember copper oxide rectifiers from before the days of selenium?)
Pcb contact fingers are usually nickel-plated copper with a thin overcoat of gold alloy. The nickel provides a hard backing for the gold and prevents diffusion of gold into the copper below. Without this the gold would completely disappear after a few months at moderately high temperatures. The gold itself is alloyed with about 0.25% nickel or cobalt which greatly increases the hardness and prevents cold welds from forming between mating gold surfaces. Otherwise the contacts would only last a few insertions.
The spring contacts are normally made from an alloy of copper with 2% beryllium. This is supplied to the connector manufacturers in a soft state where it can easily be shaped. It is then heat treated which causes the initially very fine beryllium particles to aggregate together and make larger ones which pin the copper atomic lattice more firmly together and make the spring springy and hard. The final coating is the same as for the pcb fingers.
Sometimes, contacts are coated with a layer of solder (tin+lead). A high contact pressure must then be maintained so that a cold weld forms and keeps atmospheric oxygen and contaminants out. Such contacts tend to be unreliable but are slightly cheaper than gold plated ones. The two types should never be mixed because gold and tin react to form a brittle and high resistance interface.
To put it another way, since we are creatures built of blood, mucus and feces, and since we live very short lives of abject poverty and terror (compared to God), we don't have to build equipment that has the crystalline perfection of mathematics. In fact, we can't, and if we could, we'd just break it with our clumsy fingers and dirty flakes of skin that constantly flake off of us...
Don't we come rather close in the form of silicon wafers which may only have a few dislocation defects over the whole wafer area?
John |