SI
SI
discoversearch

We've detected that you're using an ad content blocking browser plug-in or feature. Ads provide a critical source of revenue to the continued operation of Silicon Investor.  We ask that you disable ad blocking while on Silicon Investor in the best interests of our community.  If you are not using an ad blocker but are still receiving this message, make sure your browser's tracking protection is set to the 'standard' level.
Technology Stocks : Advanced Micro Devices - Moderated (AMD)
AMD 215.65+0.3%Dec 29 3:59 PM EST

 Public ReplyPrvt ReplyMark as Last ReadFilePrevious 10Next 10PreviousNext  
To: Saturn V who wrote (5176)8/15/2000 8:23:42 PM
From: pgerassiRead Replies (2) of 275872
 
Dear Saturn:

Re: IPC gains P2 vs Pentium

Out of order speculative execution is what makes P2 faster than Pentium in addition to superscalar RISC based CPU IN SPITE of longer pipeline. Had the P2 just been a longer pipeline, it would not have had more IPC. Each stage has to be able to execute more than one instruction at a time for there to be an IPC greater than 1. Pipelining only allows for the maximal use of a resource if and only if that resource is the bottleneck of the pipe. Any part that is not the bottleneck will not be fully utilized. Thus a single pipeline can only execute one instruction per advance of the pipe. That is what IPC, instructions per clock (advance), means. Thus to get a higher than one instruction per clock requires more than one pipe. Athlon has 3 decoder pipes plus 9 execution pipes. P2 had three execute pipes and two decode pipes.

Now P4 does not have many more pipes than P3. In fact it has less FPU pipes. The trace cache and the double clocked ALUs are really there to shorten the very long pipe. Without them, even Intel realizes that the stalls would destroy any pipeline speedups gained. Due to what we know of the architecture, without the unknown advantages of these two features (we can only speculate, a lot of it can be crushed by poor execution), the long pipes will reduce IPC by 30% to 50%. Hopefully with very ideal balancing of the stages, you could more than make up for it in increasing the clock but this is very difficult in a general purpose processor.

The trace cache is there for two purposes, one is to reduce the length of pipeline stalls, and the other is to reduce the amount of power used in decoding the instructions (thus allowing the decode section to be smaller (I know, more balancing)). The second depends that caching the output of the decoders, instruction reuse will allow less power, more optimization, or simply more execute pipes after this section to be better utilized thus, increasing the IPC. The first purpose assumes that the unpredicted branches will utiliize already decoded and stored sections of code more often than the increased size of the stalls that fail to be in the trace cache (correctly predicted branches will be in the trace cache or in the process of getting there). IMHO, the first is not too likely to happen but the second may be possible to gain 10 to 20% at most.

The double pumped ALU only effects two to three stages of the pipe but may act as a bigger pipe section for resource management. Here I agree with Scumbria, that the speed up in skipping a stage or two will not make up for the losses caused by being a bottleneck in speeding up the clock.

Overall, the performance hit in IPC will not be covered over by the increased clock rate by very much IMHO. Thus a 1.4 GHz P4 will not be any faster than a 1 GHz P3 for integer, FPU will be slower and SSE2 may allow it to overtake the P3 but not by as much as it would in a P3 add on. I think that the double pumped ALU will be quietly dropped as a gee whiz feature but not worth the trouble (as rumors of speed bins atest).

Pete
Report TOU ViolationShare This Post
 Public ReplyPrvt ReplyMark as Last ReadFilePrevious 10Next 10PreviousNext