SI
SI
discoversearch

We've detected that you're using an ad content blocking browser plug-in or feature. Ads provide a critical source of revenue to the continued operation of Silicon Investor.  We ask that you disable ad blocking while on Silicon Investor in the best interests of our community.  If you are not using an ad blocker but are still receiving this message, make sure your browser's tracking protection is set to the 'standard' level.
Technology Stocks : Advanced Micro Devices - Moderated (AMD) -- Ignore unavailable to you. Want to Upgrade?


To: Saturn V who wrote (5185)8/16/2000 11:27:08 AM
From: pgerassiRespond to of 275872
 
Dear Saturn:

Re: Branch prediction vs length of pipe

Hit due to branch mispredict is equal to the branch mispredict ratio to all conditional branches times the number of cycles flushed per misprediction times the IPC given no mispredictions divided by the number of instructions between each conditional branch. In typical code this is between 10 and 30. Since there is one integer pipe, one x87 FPU pipe, and one SSE (SIMD FPU) pipe, let us assume that the target IPC is 3 (same as for P3 and Athlon).
Thus with a branch mispredict of 3% per conditional, a 24 cycle average flush per misprediction (the prediction would typically favor whats in the trace cache so (28 + 20)/2 = 24), and 30 instructions between conditional braches, we get a hit of 7.2%. Now thats the lower bound. The upper bound from 5%, 3, 24, and 10 is 36%.

As you can see from these raw calculations that a correct prediction of between 95% and 97% still leads to a much larger hit to performance even though those are very good prediction values. The equation above assumes that the actual performance is the inverse of the sum of one plus the hit. Thus the percentage lost to misprediction is between 6.7 and 26.5%. If the pipe was the length in the P3 or a loss of 10 stages (the retire stages are after the conditional test) the percentage would be 3% to 13%. Given that the overall hit on the P3 is 36%, this shows that branch mispredict takes somewhere between 10% and 30% of the overall loss in a P3 makes it a significant factor.

The length of the pipe also affects anytime the task switches. This effect is magnified by the fact that the trace cache probably will not have these instructions and thus needs to be refilled. The performance hit in a heavy interrupt environment can be very high (real time operations typically are the heaviest task switching environments with large multiuser coming in second).

All in all, I can see why this could require many iterations to get the right balances. My hats off to the Intel engineering team, if they can overcome this in the typical desktop, workstation, and gaming loads. They do have their work cut out for them. Whatever results probably will be the last gasp in the longer pipeline, single core general purpose CPU realm. The CMP seems to harken back to the KISS principal (Keep it simple & smart (I am trying to be nice)).

Pete