SI
SI
discoversearch

We've detected that you're using an ad content blocking browser plug-in or feature. Ads provide a critical source of revenue to the continued operation of Silicon Investor.  We ask that you disable ad blocking while on Silicon Investor in the best interests of our community.  If you are not using an ad blocker but are still receiving this message, make sure your browser's tracking protection is set to the 'standard' level.
Technology Stocks : Advanced Micro Devices - Moderated (AMD) -- Ignore unavailable to you. Want to Upgrade?


To: Charles Gryba who wrote (53266)8/30/2001 6:14:06 PM
From: TenchusatsuRead Replies (2) | Respond to of 275872
 
Constantine, <Thread, is this guy right or wrong about the instructions per cycle for the P4?>

Unfortunately, it all depends on what your definition of "instruction" is.

Clintonian wordplay aside, remember that all modern x86 processors translate native x86 instructions into internal micro-ops. If the author says that the Pentium 4 can process up to six instructions per clock, he is obviously referring to micro-ops. I think all x86 processors made by Intel and AMD these days can decode up to three x86 instructions at a time. Beyond that, the internal micro-ops differ from one processor architecture to the next. That's enough to disqualify the number of micro-ops processed per clock as a reliable measure of performance.

I won't even go into the effect of caches on actual performance, because that should have been painfully obvious to the author.

Tenchusatsu



To: Charles Gryba who wrote (53266)8/30/2001 6:14:59 PM
From: wanna_bmwRead Replies (1) | Respond to of 275872
 
Constantine, Re: "is this guy right or wrong about the instructions per cycle for the P4 [The Pentium 4 processes up to six instructions per clock.]?"

That question will take a bit to explain.

For one thing, the Netburst micro-architecture has a complex decoder that can process one instruction per clock. Instructions usually decode into 3-4 uops, which get stored in the trace cache. The trace cache is set to send uops out to the processor back-end once every other clock cycle, and sends out up to 6 uops each time this occurs. That averages out to 3 uops per clock.

However, because of the large out-of-order window of the micro-architecture, uops tend to get stuck in queues, buffers, and registers while progressing through the large Pentium 4 pipeline. This is a good thing, since it allows a higher overall throughput, and can get very close to the 3 uops per clock issue rate.

Later in the processor back end, the uops line up for dispatch to the execution engines. The Pentium 4 can dispatch up to 4 uops every clock. One to the floating point engine, one to the load/store engine, and 2 to the double-pumped ALUs. Since there exists latency in executing uops, there is often the situation where an execution engine that is too full to accept another uop. That's why the micro-architecture includes 2 double-pumped ALUs (equivalent to 4 ALUs), a third single-pumped ALU, 2 load/store units, and 2 floating point units, one of which only accepts floating point uops that access memory.

Therefore, the micro-architecture is complicated to explain, but overall, you can expect peak performance to converge to the 3 uops issue rate of the trace cache.

Meanwhile, the Athlon issues 3 Mops (Macro-ops, as opposed to micro-ops) per cycle, and these can later be split apart by the dispatch units. The Athlon has 9 execution units, which is equivalent to the Netburst core if you count each double-pumped ALU as two virtual ALUs. Each Mop can at most be converted into 3 smaller uops once it gets to the dispatch unit, in order to feed all 9 execution units. However, the problem is that a smaller out-of-order window and the fact that few Mops can split into more than 1 or 2 uops, prevent this from being a significant advantage over the Netburst core, except for very favorable instruction streams (rendering is one such stream).

I hope this answers your question. I have done extensive research on this topic, and would be happy to answer any questions you may have.

wanna_bmw



To: Charles Gryba who wrote (53266)8/31/2001 12:39:53 AM
From: Dan3Read Replies (3) | Respond to of 275872
 
Re: is this guy right or wrong about the instructions per cycle for the P4?

P4 has 6 execution units and Athlon has 9.

Think of a 6 cylinder engine and a 9 cylinder engine.