To: Charles Gryba who wrote (53311 ) 8/30/2001 8:38:43 PM From: wanna_bmw Read Replies (1) | Respond to of 275872 Constantine, I don't think that Mops are any more valuable than uops in trying to make an apples to apples comparison. Intel and AMD simply design their instruction decoding differently. Intel does instructions->traces->uops->data AMD does instructions->Mops->uops->data It's the same methodology, just different steps. Intel uses a trace cache, which aligns uops in bundles called traces, and AMD uses Mops as an intermediate step in converting their instructions to RISC like micro-ops. It's true that different micro-architectures treat uops differently, but I think you can still compare the two to figure out average throughput, as long as you look at the collection of possible bottlenecks, because both micro-architectures have their own. Just realize that Intel's throughput is based on the bottleneck due to their cache issue rate, which averages to 3 uops per cycle. But this is an upper bound, so it's a best case. Still, the rest of the processor is optimized to allow it to be as close to this number as possible. AMD can issue 3 Mops, which can convert to up to 9 uops, but that is such a rare case, it should barely be considered. Additionally, the K7's issue rate depends on the instruction level parallelism (ILP) of the code to produce 3 Mops every cycle, and often times, it can't even do that. Therefore, even AMD's issue rate averages below 3 Mops overall. On the other hand, the trace cache allows Intel's issue rate to be more independent from the ILP of the code, since the traces are formed independent of the pipeline. Therefore, as long as instruction loops are nestled within the trace cache, Intel can get very near to an issue rate of 3 uops per clock. AMD will get 1, 2, or 3 Mops per clock (but more usually 1 or 2 Mops), which will probably end up converting to 1 or 2 uops per Mop once they get to the dispatch units. I would say that the overall issue rate won't be that much more than the Netburst core, as it will average 1-4 uops executed per clock (similar to the Pentium 4). This is why the two micro-architectures differ so much depending on the application. By optimizing the code around the Netburst core, you will be coding small, tight loops that can fit in the trace cache. By optimizing for the Athlon, you will choose instructions that can convert more commonly to a large number of uops. By optimizing for both, you will at least pay close attention to data arrays, and ensure that they can fit in cache, and you'll also make sure that instructions are chosen that are low latency for the processor to execute. wanna_bmw