Mani, re:<I still do not understand why there is more "branch miss-prediction penalty" if the CPU is more deeply pipelined.>
I'll explain this with two hypothetical processors -- the first one, PIPE4, breaks each instruction into 4 pieces, ABCD, and the second one, PIPE2, breaks it into 2 pieces, A and B.
Here's PIPE2 running a three instruction loop that repeats twice and then falls through the third time. In other words, the correct instruction sequence is 12312312345... We'll assume that the branch prediction logic assumes that instruction 1 always follows instruction 3. So the third time we execute Ins3, there will be a branch misprediction because Ins4 is really the next instruction. There are two execution units in this computer, an A unit and a B unit. "A" can do the first part of any instruction while "B" can the the second of any instruction that was started by the "A" unit on the last clock cycle.
PIPE2 Cycle 1: start A phase of instruction #1, for short we'll say 1A Cycle 2: In the "B" unit, do the second half of instruction 1; in the "A" unit, do the first half of instruction 2. For short, we say, 1B and 2A Cycle 3: 2B and 3A Cycle 4: 3B and 1A Cycle 5: 1B and 2A Cycle 6: 2B and 3A Cycle 7: 3B and 1A (starting the 3rd time thru the loop) Cycle 8: 1B and 2A Cycle 9: 2B and 3A Cycle 10: 3B and 1A It shouldn't have done 1A, but until 3B is done, we probably don't know that! Cycle 11: recover state after cycle 9 and do 4A Cycle 12: 4B and 5A Cycle 13: 5B and 6A (instruction 5 completed and 6 started) etc.
We only lost one cycle because of the very short 2 cycle pipeline, but PIPE4 machine does this: Cycle 1: 1A Cycle 2: 2A and 1B Cycle 3: 3A and 2B and 1C Cycle 4: 1A and 3B and 2C and 1D (we just completed the first instruction and have already started the second execution of the first instruction!)
Cycle 5: 2A and 1B and 3C and 2D Cycle 6: 3A and 2B and 1C and 3D Cycle 7: 1A and 3B and 2C and 1D Cycle 8: 2A and 1B and 3C and 2D Cycle 9: 3A and 2B and 1C and 3D The "A" unit of PIPE4 shouold execute the "A" chunk of instruction 4 next, but until the 3rd "3D" is completed, it won't even know its made a mistake. Cycle 10: 1a and 3B and 2C and 1D Cycle 11: 2a and 1b and 3C and 2D Cycle 12: 3a and 2b and 1c and 3D Cycle 13: recover status at end of cycle 9 and execute 4A Cycle 14: 5A and 4B Cycle 15: 6A and 5B and 4C Cycle 16: 7A and 6B and 5C and 4D Cycle 17: 8A and 7B and 6C and 5D...
PIPE4 took 17 cycles to complete the instruction sequence: 12312312345 versus 13 cycles for PIPE2, but PIPE4 could really have been completing other instructions during its first 3 cycles, so the real additional penalty for the mispredicted branch was 2 additional clock cycles.
<At what point it is no longer beneficial to have a deeper pipeline? Probably too complicated for a simple answer and depends on many factors.>
If mispredicted branches occur often enough, the branch prediction penalties will overwhelm the additional clock speed that may be gained by breaking each instruction into smaller "chunks." But other factors are probably more important. Look at instruction 13 above: "Cycle 13: recover status at end of cycle 9 and execute 4A" Easily said, not so easily done. If a pipeline is 20 deep, there must be hardware to save the state of every execution unit for the last 20 clock cycles. In general, saving state in a 20-deep-pipeline machine is 4 times as difficult (READ: 4x as much silicon) as a 10-deep pipeline. Also, breaking an instruction into tinier and tinier parts eventually can reach a point of diminishing returns. It takes more hardware (silicon) to do an instruction in 20 itsy-bitsy steps than it does to do it in 10. Thats not just because of the status saving alluded to earlier, its also because data has to be passed from one stage of a pipeline to the next. At some point, probably between 10 and 20 stages, the extra silicon required to do the pipelining actually slows down the CPU more than the clock speed can be increased. Path lenghts become longer, clock skew becomes more of a problem. Of course, with each generation in process technology, the additional silicon becomes less significant, so there is a good reason that most 0.18u processors have a deeper pipeline than 0.35u processors.
Finally, other things besides branch misprediction can cause the pipeline to "empty," and this is another reason that a longer pipeline is not necessarily better. Primarily, data dependency is the culprit. Its unavoidable in computer programs that at least some of the instructions must be totally completed before the next instruction can be started. For example, suppose the three instructions in the loop example above were 1: subtract 1 to a register 2: multiply the register by 2 3: branch back to 1 if the result is positive
In that case, instruction 2 can't be started until instruction 1 completes, so we can't do 2A and 1B simultaneously. We can't do the first part of instruction 2 (2A) until 1B is complete. In the PIPE4 computer above, we had parts of instructions 1,2 and 3 executing simultaneously on every clock cycle, which would be impossible if each instruction depended on the previous one to be complete.
These are definitely oversimplified examples. Branch prediction is less important if there are enough execution units to speculatively execute instructions. But then again, the tradeoff there is the extra silicon and path delays.
(At one point in my career, I microcoded a computer which had long pipelines. It was usually possible to get near 100% efficiency if you were doing scientific calculations on long sequences of numbers, but anything else was impractical. My favorite algorithm was the "Hough Transform," which finds straight lines in images. It took about 80 instructions, each controlling dozens of pieces of hardware to "set up" the calculation, but the inner loop of this complex scientific calculation was exactly 3 clock cycles long.
Petz |