To: Tenchusatsu who wrote (107896 ) 8/21/2000 3:42:40 PM From: pgerassi Read Replies (1) | Respond to of 186894 Dear Tench: Re: Four issue single clocked ALU vs two issue double clocked ALU Four issue single clocked ALU can do four seperate ops in one clock cycle. This is only matched when a the ops are all half clock cycle ops in the dual issue. If the instructions need a whole cycle each, the dual issue will do half the operations per cycle. If they need 1 and a half cycles the dual will complete 4 intructions every three cycles while the four issue will complete them in two cycles. If they need the full 2 cycles, the dual issue will complete only two instructions every two cycles whereas the four issue will complete 4 instruction in two cycles. This shows that the dual clocked dual issue is slower than a true four issue in all but 1 case, all half cycle ops, and even then, it only matches it. Further more, the pipe must be designed to sustain 4 micro ops per cycle all the way from the trace cache to the ALU and from the ALU to the retirement unit (at the end of the pipe), for the ALU to be fully effective. This requirement, means that the only savings to area is the difference between the 2x2 ALU vs the x4 ALU. Even given this, there is that penalty due to ALU ops longer than half a cycle or mixed length ALU ops that occur. There is a penalty for certain ALU ops may require more than the four half cycles allotted and this is in addition to the one given above. This is where the penalty for maintaining a double clocked ALU comes from (either showing as more stages or lower overall clock). It is very easy to forget that a pipeline is only as fast as its slowest stage. Having a three issue pipe in just one stage, forces a four issue pipe in the rest to three issues overall pipe. Therefore, the three issue stage is the bottleneck of the pipe. Each instruction group can highlight a different bottleneck. For example, an area of code that has a lot of pointer arithmetic, may force a lot of activity in the address generation portion of the pipe whereas the ALU waits for the data for it to process. Thus, the ALU is not the bottleneck, but the fact that there is not enough address generation units to fetch the operands that the ALU will operate on, will cause short bubbles in the pipe while the accesses occur. Even, if the ALU was fifty issue wide, it would not process the non existant data any quicker, if the address unit can only generate one data operand per cycle. This shows that the pipeline must be balanced for the average load being worked on. This is not very easy to do. Matter of fact, this shifts with time. The balance may be right for software that is used at the time it was designed, but, if the requirements change when it is actually in use, the pipe becomes unbalanced. That is usually what happens when the pipe is too highly optimized. Athlon, uses less optimization and more brute force so that the balance is favorable for a wide set of programs. The PIII seems to be more optimized but have a narrower balance point. Thus P3 does well at highly optimized benchmarks but, slower in non optimal programs than Athlon. The Athlon does better at serving, multitasking, and scientific number crunching. These are applications that rarely are predictable and thus optimizable. "One off" programs are rarely optimized and that is typical of research and development uses of number crunching. Without further testing in the real world, it is very difficult to know what the impact of all these new ideas in the P4 will be. From personal experience, typically designing to the performance edge will cause many to go over the cliff and cause much lower performance than the "simulation" would show (also one also must guard against falling into the design for the simulation datasets only trap). Pete