Re: ...the ALU, because it runs off the same clock as the rest of the design. It simply completes it's operation in one phase of the clock.
4 of the stages involve pulling data from the cache, then it takes 16 double-speed = 8 clock cycles to complete an integer operation. realworldtech.com Athlon has a 10 clock cycle integer pipeline. If it can pull data from the cache in 2 clock cycles, Athlon will have the same 8 clock cycle pipeline. The "20 stage pipeline" is either a scam, or, if there really is a lot of logic in there operating on half cycles, it may have the performance of an 8 stage pipeline with the scalability of a 9 stage pipeline. In terms of manufacturing ease and MHZ scalabiliy, Willamette's pipeline may be equal to or shorter than (and so, inferior to) Athlon's! It is almost certainly not twice as long.
Now, I'm not sure I like that, because I am impressed by per clock performance, but perhaps you should re-consider your opinion of Willamette. What if AMD determines that in some cycles up to 3 transistors can change state and they announce that Athlon actually uses a triple pumped 30 stage pipeline. Would that make it any faster? The whole point of a deep pipeline is that it spreads out the work of completing an instruction over more clock cycles, making it easier to design a processor that can run at high clock speed. Cutting the duration of each clock in half, then doubling the number of these half length clocks allocated to complete an instruction leaves the same amount of time available as half as many full length clocks. It's not going to permit any higher speed for a given quality of process. I keep bringing up this point, but you keep focusing on other aspects.
Sorry to go off on a Dennis Miller style rant, and maybe I'm missing something very big, but I think that until Intel releases more details on individual instruction latencies, we should be careful about drawing conclusions regarding the relevant length of the double-pumped pipeline.
Regards,
Dan |