To: Dan3 who wrote (53515 ) 8/31/2001 9:28:54 PM From: wanna_bmw Read Replies (1) | Respond to of 275872 Dan, Re: "P4 can issue 4 mixed integer and FP operations per cycle, plus one load and one store for a total of 6." No, Dan. You're confused. I wonder how many times I have to teach you about about Netburst micro-architecture before you get the idea that I know what I'm talking about. First, check out this link. It will help you.developer.intel.com Check out page 39 (1-17), and you'll see that the Pentium 4 has four issue ports going to seven execution units. You are confusing the issue rate of the dispatch unit with actual execution resources. If you issue 4 ALU instructions (2 per clock per ALU port) and 2 load/store instructions, you can get 6 uops executed in one clock, as you say. But this all has to do with dispatching, not the number of execution units. Since not all instructions have an execution latency of 1 cycle, it becomes necessary for the dispatcher to find an execution unit that is unoccupied. Also, not every execution unit can accommodate every instruction uop. Some units are optimized to only accept the fastest instruction uops. That is why there is a slow ALU, in addition to the other two fast ALUs to accommodate the Pentium 4's slower operations, such as shifts. It still runs at the core frequency, though. If you want to get technical, you can also look at the FP Execute execution unit, and see that it is composed of 7 simple pipelines, one that handles FP_ADD, and the others that handle FP_MUL, FP_DIV, FP_MISC, MMX_SHFT, MMX_ALU, MMX_MISC. In this way, differing instruction uop streams won't find as many of the pipelines busy calculating long instructions. Technically, all these are execution resources, but since the dispatcher can only access one at a time, they are all bundled under one label. So I am not incorrect to say that there are 7 execution units. There are 3 ALUs, 2 FPUs, and 2 load/store units, all with slightly differing functionality (not nearly as symmetric as in the Athlon). You would only be correct in saying that the dispatch unit can issue a maximum of 6 uops per cycle, in the order that you described, not that there are only 6 execution units total. But when you compare it to Athlon, you are being very misleading. The philosophy behind the Pentium 4 wasn't to anticipate every kind of instruction stream, but rather to optimize around common instruction streams. Athlon took a brute force approach, which is why it performs better on legacy code. But since coding is regularly changing, and most likely to change to become more optimized in the future, the argument over the better philosophy is moot. You have no point in comparing Pentium 4 and Athlon issue rates, since both micro-architectures have a slim chance of reaching their maximums, anyway. wanna_bmw