To: wanna_bmw who wrote (50852 ) 8/13/2001 12:37:05 AM From: pgerassi Respond to of 275872 Wanna_bmw: You forget to include the limitations of more pipeline stages. First there is a delay associated with the register needed between steps. As the amount of work done for each stage decreases, this adds to the latency of each stage and eventually becomes very significant. For example if the delay for temporary storage between stages (a sort of sample and hold circuit) is 100ps (ps = 1 trillionth of a second) and the operating frequency is 1 GHz, each stage has a latency of 900ps to do work, for jitter and 100ps for the interstage register. If you halve the work done on each stage, 450ps to do work and 100ps for the interstage nets 550ps per stage or 1.8 GHz not 2 GHz as one would assume. In the first part interstage takes 10% of the time but 18% in the second. The second problem is pipeline stage imbalances. The latency is the longest any stage takes to complete its work, an amount to take care of jitter and the interstage delay. So if 10 stages get done in 300ps, 9 stages in 400ps and one takes 550ps, the maximum speed allowed is 550ps for all stages to get their work done and you have a 1.8 GHz pipeline even if, the average stage latency of 360ps gives a speed of 2.75 GHz. A balanced pipeline can make more out of the speed than an unbalanced one even if, it is slightly slower in average latency. Many designers have stated that the Athlon pipelines are well balanced. Also, the larger the number of stages, the harder to keep any sort of balance in latency. Next is the amount of pipeline stalls or bubbles. This happens when a pipeline is about to do it work that stage but, something is not yet ready for that work. So the pipeline stalls from that point to the front of the pipeline. This creates a bubble of no work done for the period until that requirement is satisfied. Work is done to all the stuff in the pipeline after the idle stage. The longer the pipeline, the more likely this will occur. The really bad ones come from a mispredicted branch. These can stall 18 stages of a 20 stage pipeline, creating a very big bubble. These bubbles reduce the rate of work that any given pipeline will do per cycle. You can tell that this is very significant from the amount of work done to mitigate this by the CPU designers. Lastly, there is a problem with unbalances in the width of a pipeline. If the instructions on what to do next are found in the trace cache of a P4, up to three things can be started at once. If a branch mispredict occurs and the next instruction has not been decoded in the trace cache, only one instruction is decoded per stage and 8 additional stages are added to the time before something is done. This shows that P4 is highly weighted to the back end. If there is a lot of mispredicted branches or where the working code set is larger than the trace cache, the IPC drops below 1, sometimes by quite a bit. Athlon can decode as fast as it can execute and it can do up to 6 things at once (with some limitations) per stage. This is also referred to as an imbalance in pipelines as a whole. It is easy to confuse one with the other except for context. It is likely that some of the speed difference between P3 and Athlon is due to the well balanced design and the rest is due to process. Between the P4 and the Athlon, more is due to the very long pipeline, but the limitations keep it from being 2.8 times faster in clock and perform much worse in both IPC and overall performance. Pete