To: Charles Gryba who wrote (53266 ) 8/30/2001 6:14:59 PM From: wanna_bmw Read Replies (1) | Respond to of 275872 Constantine, Re: "is this guy right or wrong about the instructions per cycle for the P4 [The Pentium 4 processes up to six instructions per clock.]?" That question will take a bit to explain. For one thing, the Netburst micro-architecture has a complex decoder that can process one instruction per clock. Instructions usually decode into 3-4 uops, which get stored in the trace cache. The trace cache is set to send uops out to the processor back-end once every other clock cycle, and sends out up to 6 uops each time this occurs. That averages out to 3 uops per clock. However, because of the large out-of-order window of the micro-architecture, uops tend to get stuck in queues, buffers, and registers while progressing through the large Pentium 4 pipeline. This is a good thing, since it allows a higher overall throughput, and can get very close to the 3 uops per clock issue rate. Later in the processor back end, the uops line up for dispatch to the execution engines. The Pentium 4 can dispatch up to 4 uops every clock. One to the floating point engine, one to the load/store engine, and 2 to the double-pumped ALUs. Since there exists latency in executing uops, there is often the situation where an execution engine that is too full to accept another uop. That's why the micro-architecture includes 2 double-pumped ALUs (equivalent to 4 ALUs), a third single-pumped ALU, 2 load/store units, and 2 floating point units, one of which only accepts floating point uops that access memory. Therefore, the micro-architecture is complicated to explain, but overall, you can expect peak performance to converge to the 3 uops issue rate of the trace cache. Meanwhile, the Athlon issues 3 Mops (Macro-ops, as opposed to micro-ops) per cycle, and these can later be split apart by the dispatch units. The Athlon has 9 execution units, which is equivalent to the Netburst core if you count each double-pumped ALU as two virtual ALUs. Each Mop can at most be converted into 3 smaller uops once it gets to the dispatch unit, in order to feed all 9 execution units. However, the problem is that a smaller out-of-order window and the fact that few Mops can split into more than 1 or 2 uops, prevent this from being a significant advantage over the Netburst core, except for very favorable instruction streams (rendering is one such stream). I hope this answers your question. I have done extensive research on this topic, and would be happy to answer any questions you may have. wanna_bmw