To: Bilow who wrote (31037 ) 9/28/1999 2:13:00 PM From: John Walliker Read Replies (1) | Respond to of 93625
Carl, You missed my point.Your metastability calculations were done under the assumption that only a metastability hit would cause a system error. (I'm not sure that the calculation are correct even assuming this.) But that is not the assumption you should make. Besides, the calculations for probability of detection should be adjusted. Typically, test machines are run for hours, so even if there is only a 0.000000000001% chance of a problem, the problem will very likely be detected, if the problem is being looked for on a 16-bit wide bus at a rate of near 800 million opportunities per second. The point that I was trying to make is that there could be a timing violation that only happens when the bus propagation delay is an exact multiple of the clock period (say). You yourself suggested something like this might happen in an earlier post. If the position on the bus where the delay matches this condition falls half way between two chips then no matter how long you soak test the system for, the error will NEVER be detected. Even if you test the system with a number of motherboard designs and a few combinations of RIMMs and a few fixed clock speeds you have a significant probability of missing the problem no matter how thoroughly and exhaustively each combination is tested. Then somebody comes along and tests a new combination of RIMMs on their new motherboard (Compaq? Dell?) and finds that problems have occurred. They tell Intel and Rambus who are completely unable to replicate the problem until that system is taken to their labs. The problem occurs because the low probability event of the metastability window coinciding with the delay between the controller and a memory chip has finally happened. The narrower that time window, the less likely it is for the fault to be detected. It does not matter whether my probability calculation is very accurate - the nearest order of magnitude is good enough for this argument to stand. I believe that a scenario like this is consistent with the published information, such as it is, and with all the parties involved having done their very best. Such a fault would only be detected with certainty by an extremely slow frequency sweep such as I suggested. I don't believe that this is a usual method of testing. I don't believe that Rambus, Intel and many others would miss a simple signal integrity problem when test methods for ensuring signal integrity are so well documented and test equipment for this purpose is available. I have personal experience of a problem with one of my designs that was in principle similar to this and was not detected until the product had been on the market for about 6 years. John