Pravin, one thing I forgot to say about long pipelines. Although the total throughput of instructions would be higher thanks to the higher clock speed, the latency is longer. As you can guess already, starting from an empty pipe, it takes longer to retire the first instruction (i.e. send it all the way to the end). That means events that clear the pipe, such as branch mispredicts, exceptions, etc., would cause a bigger penalty.
Intel's Willamette has features that deal with such long latencies:
- The branch prediction unit is improved. Although I don't know the details, I'll bet it's better than Athlon and certainly better than P6. Therefore, branch mispredicts will occur less often.
- The ALU is "double-pumped," i.e. running at 2x the frequency. According to the Willamette Developer's Guide, this "reduces latency and increases the performance for certain integer instructions."
- Execution trace cache stores pre-decoded micro-ops, so presumably this reduces the penalty of a flushed pipeline.
So it seems Intel went all out to build a processor around a very long pipeline. As I pointed out before, it does seem like Intel is betting everything on clock speed.
Tenchusatsu
P.S. - There are also "hazards" in a program's code which causes "pipeline bubbles," i.e. holes in the stream of instructions. In a longer pipeline, those bubbles will be larger, once again impacting real-world performance. |