SI
SI
discoversearch

We've detected that you're using an ad content blocking browser plug-in or feature. Ads provide a critical source of revenue to the continued operation of Silicon Investor.  We ask that you disable ad blocking while on Silicon Investor in the best interests of our community.  If you are not using an ad blocker but are still receiving this message, make sure your browser's tracking protection is set to the 'standard' level.
Technology Stocks : Advanced Micro Devices - Moderated (AMD) -- Ignore unavailable to you. Want to Upgrade?


To: Dan3 who wrote (53515)8/31/2001 8:53:53 PM
From: AK2004Respond to of 275872
 
<deleted>



To: Dan3 who wrote (53515)8/31/2001 9:28:54 PM
From: wanna_bmwRead Replies (1) | Respond to of 275872
 
Dan, Re: "P4 can issue 4 mixed integer and FP operations per cycle, plus one load and one store for a total of 6."

No, Dan. You're confused. I wonder how many times I have to teach you about about Netburst micro-architecture before you get the idea that I know what I'm talking about. First, check out this link. It will help you.

developer.intel.com

Check out page 39 (1-17), and you'll see that the Pentium 4 has four issue ports going to seven execution units. You are confusing the issue rate of the dispatch unit with actual execution resources. If you issue 4 ALU instructions (2 per clock per ALU port) and 2 load/store instructions, you can get 6 uops executed in one clock, as you say. But this all has to do with dispatching, not the number of execution units.

Since not all instructions have an execution latency of 1 cycle, it becomes necessary for the dispatcher to find an execution unit that is unoccupied. Also, not every execution unit can accommodate every instruction uop. Some units are optimized to only accept the fastest instruction uops. That is why there is a slow ALU, in addition to the other two fast ALUs to accommodate the Pentium 4's slower operations, such as shifts. It still runs at the core frequency, though.

If you want to get technical, you can also look at the FP Execute execution unit, and see that it is composed of 7 simple pipelines, one that handles FP_ADD, and the others that handle FP_MUL, FP_DIV, FP_MISC, MMX_SHFT, MMX_ALU, MMX_MISC. In this way, differing instruction uop streams won't find as many of the pipelines busy calculating long instructions. Technically, all these are execution resources, but since the dispatcher can only access one at a time, they are all bundled under one label.

So I am not incorrect to say that there are 7 execution units. There are 3 ALUs, 2 FPUs, and 2 load/store units, all with slightly differing functionality (not nearly as symmetric as in the Athlon). You would only be correct in saying that the dispatch unit can issue a maximum of 6 uops per cycle, in the order that you described, not that there are only 6 execution units total.

But when you compare it to Athlon, you are being very misleading. The philosophy behind the Pentium 4 wasn't to anticipate every kind of instruction stream, but rather to optimize around common instruction streams. Athlon took a brute force approach, which is why it performs better on legacy code. But since coding is regularly changing, and most likely to change to become more optimized in the future, the argument over the better philosophy is moot. You have no point in comparing Pentium 4 and Athlon issue rates, since both micro-architectures have a slim chance of reaching their maximums, anyway.

wanna_bmw