SI
SI
discoversearch

We've detected that you're using an ad content blocking browser plug-in or feature. Ads provide a critical source of revenue to the continued operation of Silicon Investor.  We ask that you disable ad blocking while on Silicon Investor in the best interests of our community.  If you are not using an ad blocker but are still receiving this message, make sure your browser's tracking protection is set to the 'standard' level.
Technology Stocks : Intel Corporation (INTC) -- Ignore unavailable to you. Want to Upgrade?


To: Tenchusatsu who wrote (107932)8/21/2000 7:42:24 PM
From: EricRR  Respond to of 186894
 
Don't forget that part of the penalty is reduced thanks to the Execution Trace Cache. Instead of having to decode instructions again from the mispredicted branch onward, the pipeline can immediately be filled with predecoded instructions in program order.

Also remember that Pentium III's pipeline was 12-14 stages, not 10 stages as you are assuming in your "x2 pipeline" phrase.


Thanks for the correction on the p3. I thought thought that the 20 stage pipeline for willy assumes a trace hit. Without that, isn't it 26 stages?



To: Tenchusatsu who wrote (107932)8/21/2000 8:10:28 PM
From: kapkan4u  Read Replies (1) | Respond to of 186894
 
<Don't forget that part of the penalty is reduced thanks to the Execution Trace Cache. Instead of having to decode instructions again from the mispredicted branch onward, the pipeline can immediately be filled with predecoded instructions in program order.>

Don't forget that part of the penalty is increased when the branch target is outside of Execution Trace Cache. Not only the pipeline is flushed, but also the execution trace has to be decoded into the Trace Cache.

I think that it was you who said that P4 retained the clunky PIII decoders running at half the speed (700MHz for the 1.4GHz part) of the core.

Kap



To: Tenchusatsu who wrote (107932)8/22/2000 3:51:24 PM
From: pgerassi  Respond to of 186894
 
Dear Tench:

Re: Branch Misprediction Penalties

I think all of us seem to forget in our calculations one non trivial fact. The whole pipeline is not flushed when a mispredict happens. Just that section between fetching memory for instructions or traces in P4 and the point where the misprediction is realized in the pipe. The misprediction is not realized when an instruction is retired (the last section of an out of order pipeline which applies to all CPUs we are discussing except Itanium (a real performance disadvantage occurs because of no OOE)) but, when the compare is checked (typically two to four stages before end) thus, in P3, 10 stages, in Athlon, 8 stages and 16 stages when in trace, and 24 when out of trace in P4. Since the average of a mispredict going in trace (most good predictions go to trace cache (it was there before)) is around 50%, the average mispredict is 20 stages for P4. Thus, Scumbria is not wrong about the doubling of mispredict penalty in P4 in comparison to P3. Against Athlon, it is two and a half times. Also the small L1 caches will cause an even larger amount of misses to L2 than either P3 or Athlon relatively. All in all, 70% reduction of performance is not all that far off for a SCG.

Pete