To: jcholewa who wrote (53595 ) 9/1/2001 9:29:08 PM From: wanna_bmw Read Replies (2) | Respond to of 275872 JC, Re: "I understand that a MacroOp can potentially be made of as low as one Op, but how can you say that three MacroOps (totalling between three and six Ops) per clock averages out to the same as three microOps per clock?" At the stage of the pipeline where macro-ops enter the K7 back-end, they are not multiple micro-ops, but are treated as individual entities. Only at the dispatch stage do macro-ops conceivably split into multiple micro-ops, but my comparison was strictly based on the issue rate of the front-end. If these macro-ops do reach the dispatch phase, and are able to split into more than one micro-op, there is still the business of checking for data availability, dependencies, and branches. Even if the Athlon processed 3 macro-ops in a given clock, and each of those could be split into 3 micro-ops, it doesn't mean that all 9 micro-ops will be executed at the same time. Due to the lower level of ILP in x86 architecture, these instructions usually will stall the pipeline such that a rate of 3 micro-ops per clock is achieved overall. Many studies of ILP in x86 architecture have confirmed this. However, the Athlon was engineered with the philosophy that brute forcing the issue rate would yield better performance for instruction streams optimized to take advantage of higher ILP, but such is the design philosophy when in lack of a better way. Those critics that say the Athlon is a derivative of the design philosophy of the P6 core are not really that far off. At the time, the best road to performance was to go with a small, efficient OOOE design, and equally allow for any kind of instruction stream. Largely, that's what the Athlon offers. On the other hand, the irregular nature of the Netburst core is due to a finer understanding of the habits of the x86 instruction set. In actuality, there are no more execution units than there needs to be. Such was confirmed late last year when an Intel engineer reported that the Pentium 4 had at one time included a third floating point pipeline, but it was removed in the desire for a smaller, lower power core, since the performance penalty wouldn't be affected that much without it (the engineer estimated only a 5% loss due to one less floating point unit). I'm sure there are other examples, but the point you should take away from this is that comparing the specifications of individual internal components in each micro-architecture is not going to yield a definitive apples to apples comparison for performance. Both processors were architected using different philosophies, so each is going to behave in different ways. Each is going to have their bottlenecks in different places, but both of them were designed for performance by a number of extremely intelligent engineers, and nobody on an enthusiast investment forum (or any forum for that matter) is going to be able to read the micro-architectural specifications and be able to point out failures in the design. wanna_bmw