HI earlie, I think that the imbedded chip story is the essential Y2K issue that will not be solved, very few seem to follow it.
Gary Norths Links and Forums Summary and Comments (feel free to mail this page)
Category: Noncompliant_Chips Date: 1999-04-13 07:24:59 Subject: 25% Systems Failure Rate: The End of the Case for Y2K Optimism Link: webpal.org Comment: I am posting only this link today. The author of this report is a former computer science professor and the holder of microprocessor patents. I regard this report as the most significant document that I have posted on this site since I began this project. Click through. Print it out. Read it. Mark it with a highlighter. Read it again. You must come to grips with the information in this report. It confirms what David Hall has been saying about the necessity of testing embedded systems. But it goes beyond Hall. It shows that it is impossible to test large numbers of them without removing them from the boards in which they are embedded. To test them while they are installed and on line is to risk shutting down whole systems permanently. The problem is the secondary clock, as you will learn. The logic of the chips is like the logic in legacy software: layers of forgotten code, all reaching back to the original starting date of the chip. In a TV interview with engineers in a chemical plant, the author learned that they estimated a 25% systems failure rate. This is close to the average that the SIM found in 1997: 5% to 50%. Unless he is completely wrong, this means that the typical testing that gets reported by the media is useless and misleading. To turn a clock forward to 2000 and not have it shut down the system is not a valid test for a programmed chip, if this is the only test, which confirms what IBM has said. This means that the world will be flying blind in the early months of 2000. If the 25% systems failure rate holds up, there will be thousands of catastrophic failures. How many explosions at chemical plants will it take for communities to shut them all down? There is no valid case for y2k optimism-the bump in the road- until this document is refuted. Until Senator Bennet has this man testify, and half a dozen experts testify to show that he is wrong, the Special Committee on the Year 2000 Technology Problem will be a charade. In short, this is it. This is the document that your personal y2k plans must reflect. I would recommend that you subscribe to his free e-mail report. He shows how at the end of this report.
* * * * * * * * * *
I have REPEATEDLY heard or read statements made by people in VERY HIGH places that their systems do not have a Y2K embedded processor problem because their systems: Don't show a date. Don't involve time. Don't have a time problem. I have even read some Japanese, Chinese and Russian statements that they don't have a problem because they don't even use the same Calender as the US does (because of its predominant US Protestant and Roman Catholic Christianity). These highly placed people are wrong, wrong, wrong. The reason they are WRONG is because they do NOT understand the embedded processor SECONDARY clock issue. . . . Since I am posting this on a web page, and because others may see this who are not familiar with the source let me first present my credentials for explaining what I am about to explain. I am a former college professor of computer science, and hold both U.S. and Canadian microprocessor patents. The particular application that this paper discusses is oil refineries. I spent one summer as a NUL Fellow with the Chevron Chemical Company, and was at one time a consultant to the Imperial Oil Company, both times in regards to computer systems. I recently completed a lengthy study and report on natural gas pipelines, published at: webpal.org and it was feedback from that report that prompted this inquiry. In preparation for the high point of this study (as the INTERVIEWER for a TV crew doing an interview at a large Oil Company Refinery Research Facility), I spent five days reviewing all the microprocessor information that I could find in two libraries and on the Internet, and by consulting with some other individuals in the field. I felt well prepared to conduct the interview and I would say that at this point that I feel that I have a better grasp of the situation than 99% of the other people that I meet. First of all let me emphasize regarding the SECONDARY clock problem with Microprocessors we are not talking about wall calender time. And we are NOT talking about wall clock time. . . . ALL microprocessors contain a PRIMARY first clock that we may call its heart beat clock. The tick of this clock determines the speed of the processor. Some of these clocks are very fast, ticking at the speed of millions, or billions of times a second. However, these are NOT the clocks about which we are concerned. Many of these processors have either a built-in SECONDARY clock, or, as is the more usual case, associated with them in a second chip, a SECONDARY clock whose time and date is maintained by the first clock just as the wheels of old mechanical time pieces were governed by the spring and escape mechanism. The first clock is much like a metronome, or the swinging pendulum on the old Grandfathers clock. This first clock in the microprocessor is like a drum beat that is keeping all the functions of the microprocessor marching in order. On one beat an instruction from the microprocessor's internal memory is fetched, on the next beat that instruction is executed. The pendulum then swings and it is fetch time, and then again it is execution time. Many times a second the dance goes on during the life of the microprocessor. Among the instructions executed may be that of maintaining a SECONDARY clock. . . . Many computer programs do not use the SECONDARY clock, or many people did not concern themselves with such things as file dates on their computer files being incorrect. They may feel that if they were not using the SECONDARY clock, of if the SECONDARY clock setting made no difference to them then the computer was not using the SECONDARY clock or the SECONDARY clock was not affecting the computer. In very old computers this may have often been true. However, as computers and software progressed the designers made more use of the SECONDARY clocks. . . . What we are talking about is tiny embedded microprocessors, that have been built into all sorts of industrial devices like machine and valve controls. The rules for the operation of these SECONDARY clocks and the operation of the microprocessor itself were maintained in a special kind of memory called ROM, which stands for Read Only Memory. . . .
With the invention of the microchip the "wiring" was done inside the ROM chip when it was manufactured, in much the same way that a printed circuit board is made. Later, because so many different kinds of microprocessors were being developed a new method of more easily programming the ROMs with the computer's or microprocessor's built-in start up program was developed. (This chip on computer mother boards is called the BIOS, which stands for Basic Input Output System). The new type of ROM was called Programmable Read Only Memory or PROM. . . . Thus the PROM chip could be programmed one time (and one time only). . . . What they needed was a PROM that could be erased and re-recorded. This new Erasable PROM was called an EPROM. . . .
You could think of an EPROM chip as being a little bit like a camera. Inside were these connections sensitive to light that would set them all to one. You set them, as before with an electric current, but then, if you made a mistake, you could open up to the light a little window on the top of the camera like chip, and expose the interior of the chip to ultraviolet light (a matter of a few seconds to a few minutes depending on the chip and light source) and then cover up the little window again (usually with a piece of tape) and start over again. In designing computer ROM memories, this EPROM was the way to go, and once you had the program working perfectly, if you were going to make tens of thousands, then it might be better to send the program to the factory and have it burnt into cheaper ROMs. However, if you were going to use just a few of these chips, before changing the design, or because that is all that you were going to make, it was cheaper to just go ahead and use the EPROM, or to burn the program into the necessary number of PROMs. . . . The real problem is because engineers began using microprocessors to build other specialized devices such as PLCs (Programmable Logic Controllers) which they then used to control specialized machine tools and such things as automatic valves. In these they too used these EPROMs to design their systems and they too sometimes put the EPROM in the system and they sometimes instead used Field Programmable ROMs, to make a number of the devices. Programming these devices can be very, very long and time consuming work and oftentimes requires a great amount of skill to do the task efficiently. The way that engineers got around that great expense (which was rather like re-inventing the wheel each time), was to take an existing program, that ALMOST did what they wanted to do, and then to slightly modify it. The companies that sold them the EPROM programming equipment, also provided the users' engineers with libraries of programs and information about how they could combine the programs and modify them. This became know as the LADDER concept. You started with a number of steps and then you added another step on the ladder. Someone else wanting to do much of what you had done, then could use what you had done and add another step on the ladder. The intial LADDERs were much like Truth Tables. Later more programming capabilites were added in mnemonic form, and most recently greater levels of abstraction have been obtained by a hybrid relationship between the original LADDERs and programs like Visual BASIC. Engineers came and went in companies. Even in the companies that designed the EPROM burning equipment to begin with, and no one really knew, or knows to this day, what is hidden back down the ladder or is in the EPROM or PROM program. . . . In some of the programs, and chips, designed using this method, we can see that there are specifications for a SECONDARY clock, which we can access if we wish. Even if we don't use it ourselves, we don't know that someone else on some other rung of the ladder or in some earlier version level of the program didn't use it. However, oftentimes the ladder and program have come to us in such a way, that documentation of many of the elements used to build them are completely lost, and no one knows whether there is a SECONDARY clock in there or not. And there is no practical way to look and find out. Along with the SECONDARY clock, when the chip manufacturer built the chip and set it working they set its time as a part of the manufacturing process. The time was set in accordance with some original date of design of the clock as programmed into the original code and updated in accordance with the current time of when and where it was manufactured. No evil intent was meant, it is just that everything has to have a starting place, (and within limited numbers an ending place or starting over place - 00) and that was coincidently the way some of the chip logic was designed. Now, all this was based on real time, but not necessarily Zulu time, (the time at the grand meridian) and because the clocks are not THAT accurate, and some may have later accidently gotten set back to their starting date, we can't say that they are all going to stop working EXACTLY at midnight either Greenwich Time or local time. The nature of the coding problem that we have been describing is explained at another independent source. . . . And there were many standard IC products put out by manufacturers that contained such clocks. Every manufacturer has long lists of such devices and fortunately many of the manufacturers are making this fact known to the public. For an example look at: mot-sps.com And you may still say, but who cares, because we are not using a time function. And that may be true, that YOU are not using a time function, but the chips, I have just described, may be. Chips often control things based on intervals. Such as checking every so often (perhaps in milliseconds) as to whether they should check to see if it has sensed a train coming and should lower the gates, or that the train has passed and that it should raise the gates. For this purpose it may, and very often does, use the SECONDARY internal clock to keep track of how much time has passed and whether or not it should check for the presence of a train. No one knows what logic the engineer designing the gate control used, nor for that matter did the engineer designing the control have any idea what logic the programmers designing the earlier rungs down the Logic Ladder, may have used. All he knew is that the instructions said, put in this signal and under these conditions you will get out that signal. In order to measure a delay, or interval, the logic in the Logic Ladder oftentimes used the difference between the current SECONDARY clock time and an earlier time from the same SECONDARY clock. This all works well and good and without any concern about whatever time people are using out in the real world. That is it works fine as long as you subtract a lesser time from a greater time. BUT should it ever occur that the system subtracts a greater number from a lesser number you will get a negative result. Something which will happen ONE TIME ONLY and that is when the SECONDARY clock flips over to zero zero for the year Y2K. What will happen when this happens is called UNDEFINED. Sometimes UNDEFINED results are not that bad but oftentimes they are and Engineers really do not like to be surprised by them. Unfortunately, we don't know exactly how this PROM logic works in many, many devices. It is VERY difficult to test for because there is NO PRACTICAL WAY to go around and set those INTERNAL SECONDARY clocks, even if we can determine that they are there and find them. The many people who do go around and set forward EXTERNAL clocks to the year 2000 on their systems may STILL be in for a horrible surprise, because setting external clocks often does not effect the SECONDARY internal clock. For some systems changing the EXTERNAL clocks has caused an interaction between them and the internal clock in such a way that the system has failed, but that is not always the case. This is why all the tests that we hear about the FAA flying planes on which they have set forward the clocks, or setting forward the clocks in the Air Traffic Control Centers or setting forward the clocks in the Electrical Generating Plants or setting forward the clocks anywhere else may have had ABSOLUTELY NO EFFECT on the SECONDARY clock itself. On those occassions that I have heard of people going to the EXTREME EFFORT of changing the SECONDARY internal clock, or the PROM logic that uses it, they have found disasterous results. . . .
Moreover, once they have prompted this result, MANY of them were never able to get the device to ever function again. The only option has been to replace the device. Sometimes they could not replace just the chip, because the old chips or programming for them was no longer available, and new types of chips would not fit the printed circuit board. They couldn't replace the circuit board, with one that would hold the new chip because they don't exist. They often couldn't even replace the PLC (Programmed Logic Controller) with a new PLC because the new ones were not made to fit into the old cabinet or to inteface with the old device. The only choice was to replace the entire valve or switch system. Discussing this situation with engineers, I cannot tell how many of them REALLY understand the problem. When I search the literature, I do not really find much discussion of it. When I speak with Engineers who have actually solved the problem, they of course understand it, but unfortunately I speak with many more that do not seem to understand it. But, the REAL fact is, that I cannot find THAT many who will, or who are permitted to speak with me. . . . It was at this point, that I was ready to begin the interview. and indeed I had written up this presentation to this point and submitted it to the TV producer and to the Head of Engineering at the beginning of the 3 hour visit and TV shoot at the large oil company Refinery Research Facility. The interview is on record, and was done by an Independent TV Producer under contract for an affiliate of ABC. However, I will not identify the oil company here because I was allowed to ask off record as many questions as I wished. My first question was, "How serious is the problem?" I was surprised by the answer. Approximately 25% of the relevant systems had problems. I had only heard such a high figure before from Westergaard. Usually, the figures given have lain between 1% and 7%. I have always used a median figure of 3%. So, this certainly increased my already high respect for Westergaard as a source for reliable information. My second question was, "How many systems are we talking about". The answer, in a large refinery, several thousand. There are of course tens of thouands, maybe even hundreds of thousand of chips, but the Relevant chips, the ROMs and Microprocessors, and such as I have described, amount to 2 or 3 thousand. The Engineer also made it very clear, that he was NOT talking about chips in PC,s or in Fax machines, or photocopiers, or in the office accounting systems, but STRICTLY in the process controllers of the plant. My third question was, "How do you go about determining if a system is compliant or not". The answer is, that first it is determined if a system is CRITICAL or ESSENTIAL to the process. secondly it is determined if the system needs to be Y2K COMPLIANT or just Y2K READY. (Y2K ready systems are systems that will put out a wrong date but that date is not critical to the operation of the system). Many of the devices that we are talking about are PLC,s. (Programmed Logic Controllers) and I was eager to look at a PLC. So, what does one look like? Well, many of them are a metal box, about the size of a brick. In fact, at this facility at least, that is their nickname for them, "the brick". But others can be the size of a bread box and others as large or perhaps a bit larger than a large home refrigerator. . . . Now comes the interesting part, as it was explained to me. These PLC,s are NEVER tested on-line. EACH and EVERY ONE of them is removed from the system, and taken off-line for testing. There are five tests performed on each one. . . . It is where they could they would interface the system with an advanced clock. They would watch the system go through the clock change from 99 to 00. Even if the system successfully performed in that function they would then turn the system off and see if they could restart it. Much to their initial surprise, (and I had previously heard this from other engineers) some of the systems would not then restart. I was also interested to learn that they have established a number of particular HIGH CHECK dates for the systems once they are back in operation. There is of course the 99 to 00 rollover date. There is also the March 1, 2000 date to see if it did leap year calculation correctly. There is the January 1, 2001 date check to see if the programs managed to keep correct count of the number of days through the year 2,000. . . . So now we got down to my MOST IMPORTANT QUESTION, "How is it you can go into a ROM and determine whether a SECONDARY clock is being used by the program?" I listened to the long answer and then I said, "Sorry, but I may be a little dense, I don't understand how you can go in and set the SECONDARY clock, so please explain to me again how you can tell if the program is using the clock, and how it is using the clock?" Missed it again. "Ummm, sorry, I know that we have been around this mulberry bush a couple of times, but I still have missed the explanation. Could we do that one more time?" Yep. "Well says the producer, maybe we can take a break here for a moment". So now off camera the Engineer says, "Well, as my boss says, it is like looking for a needle in haystack, we know how big the haystack is but we don't know how many needles there are." "Yes", said I, "but do you even know HOW to find the needle?" Not wishing to be boorish, I went on to other subjects. . . . Conservatively 2,000 critical systems [are] in a refinery, having about 25% problems or 500 with problems. These are being corrected one by one. But let us go back to my old rule of 3% and say that applies to only the identified 500 that have what I will call the hidden clock. This then gives us still 15 needles in the haystack that won't be found. If it were based on the original number of 2000 then there would be 60, and I guess that is where I probably feel that it lies in this situation. Somewhere between 15 and 60 and possibly nearer to the 15. Even those 15, with their UNDEFINED results, may or may not have THAT detrimental an effect. In fact, I would say that will be the case in 90% of the cases. But this still leaves the probability of 1 or 2 catastrophic events in such a large installation. It just does not seem realistic that we can be 100% sure that we will get 100% of the needles that can cause a catastrophe. It comes down to a matter of probabilities. In any given year even now, there is a probability that a certain number of plants will have an accident of such severity that it will cause fatalities. This is a fact of life. (Consider the past year). And it is, in my mind a very HIGH Probability, that the Y2K defect will cause more problems than would otherwise occur. I consider this a very conservative statement. Whether or not there may be catastrophes on the Bhopal scale, I don't know. With the proper effort and policies I think those kind of disasters could be avoided. Whether or not they will be depends on how well the problem is understood. In talking with City Engineers and Water Work people, I have not felt that they comprehended the problem at all. So, when I mentioned this to our Engineer friends, and asked, "How well do you think others understand this problem?" they sort of chuckled. "Not very likely for people like that who are not in a major research facility to have a grasp of it", was their reply. "But, anyhow, they don't have a problem, nor the number of processors, anything like a petroleum refinery has", they went on to add. Then the next morning, the following came through in my email: "Sydney Water Ready for Y2K (Merri Mack, Computerworld Australia) Sydney Water (Sydney, Australia) started working on Y2K in 1996. Alex Walker, managing director, says all systems will be compliant by the end of June; he says that only 1.57% of the systems remain to be tested. Some 90,000 embedded chips in operational plants and telemetry had to be tested. There were 36,000 with potential problems, and "10,000 of those needed special assessment. Of these 10,000 Sydney Water found between 1000 and 1,500 needed something done." Walker said the work is virtually finished." Wait a minute! This water works had 90,000 chips that need testing? And 1,000 that needed something done!!!!! Whoa! That is twice as many as we were figuring above for the refinery. My grip on reality is starting to slip again. 90,000 chips for a water works! I hope not. But if it is only 1% of that I don't think most of the City Water Engineers in the world have the sort of grasp of the microprocessor problem, that we are talking about here and that is necessary to solve it. They don't have that the Engineering Research budget to begin to tackle it in the first place. And the other thing that really bothered me about my 5 days in the 2 libraries and on the Internet devoted to finding the literature concened with this problem - I didn't find any. "How come", I asked the Engineer. "It is not a theoretical research problem", he replied. . . . I had lots of other candid discussion, all to the point, with these Engineers. It would take far too long to repeat it all, but here are a couple of closing quickies- "How is your relationship with your suppliers?" "We spell out to them that they HAVE to show us that they are Y2K compliant, or we will find a new supplier. And they do, because they don't like saying to their boss - 'Boss, guess what, we just lost that million dollar contract because we aren't Y2K compliant'". "Do you share your findings with your competitors?" "Heck no. This is a hardball world. If they can't figure it out we will love to have their customers". "And you are REALLY sure that you are going to be ready?" "Absolutely!" Hmmm, I am thinking. I wonder what HIS boss would say if he said, "You know boss, I think there is a real possibility the whole place is going to come to a screeching halt with Y2K and I don't have a clue as how to find those last needles in the haystack." But instead they say, "Oh, yep, yep, yep. We are okay. We have it figured out. Now, just so long as the electric company, the railroad, and all those other guys have it figured out we are going to be okay. Oh, there may be some good solid bumps in the road, here and there, and I sure wouldn't want to be living in a city ghetto, with or without Y2K, but yep, yep, yep, it is all going to be okay." If I hadn't minded being gauche, I would have said one more time, "Okay, explain to me how you can be so confident, when you don't know how to find the SECONDARY clock needles in the haystack, and when you are SO MUCH MORE KNOWLEDGEABLE here at this top grade Research Facility with over 900 research employees, than the average run of the mill chemical plant, (which has maybe a couple of hundred employees altogether), or a medium city engineering office with maybe less than 20 engineers, and maybe none of them Ph.D,s.? How can you be so confident the rest of the world is going to make it?" . . . . Who else is figuring this out besides me. I really think that I have a better grasp of this than over 99% of the people out there. But, if you think that I am bragging, I am NOT. I am complaining, that I can't find lots of people that understand this better than I. So, if you are in that other 1% and would be so kind as to point out the fallacy in my thinking, I am telling you SINCERELY that I would GREATLY appreciate it. To subscribe to my short weekly letter, write to Y2KFIND-subscribe@listbot.com Peace and love,
Bruce Beach
survival@webpal.org |