SI
SI
discoversearch

We've detected that you're using an ad content blocking browser plug-in or feature. Ads provide a critical source of revenue to the continued operation of Silicon Investor.  We ask that you disable ad blocking while on Silicon Investor in the best interests of our community.  If you are not using an ad blocker but are still receiving this message, make sure your browser's tracking protection is set to the 'standard' level.
Politics : Formerly About Advanced Micro Devices -- Ignore unavailable to you. Want to Upgrade?


To: milo_morai who wrote (103084)4/8/2000 1:58:00 PM
From: Dan3  Read Replies (2) | Respond to of 1574854
 
Interesting discussion of Willamette vs. Athlon. It's still not clear how much of an advantage (if any) Willamette will have over Athlon. Remember, the guys in this discussion who are most convinced that Willamette will be faster than Athlon were certain that Athlon would slaughter coppermine clock for clock. The discussion starts here:
aceshardware.com

with:

At first, my congratulation to Paul. A very informative article.

But i think he left some points beside.
I want to mention some. That are important for me:

1) Anything coming in must come out.
Given Willi could issue 6 uops each cycle(which i doubt by the IDF presentation slides. They clearly show 3 uops all the way from TC to dispatcher) these 6uops have to be executed. Ok. What has Willi got:
INT: virtually 4 ALUs, 1 Load, 1 store
FP: 1 SSE2 enable FP unit 1 FP load/Store unit

INT I see here a very asymmetric approach. Only 1 load and 1 store unit? This has proven to become a bottleneck in my tries to process some uops through Willi's execution pipeline.
FP. Intel stresses SSE2. and it is clear. Willi's power's to execute normel fp code are limited. There is only one adder and multiply unit. effectivly throughput for DSP like code. One fp instruction per clock. Load and Store are handled well enough by the extra fp load/Store unit

2) The trace cache stores u ops. Paul said in his article, that one x86 instruction can decode to as much 6 uops, or 3.5uops in average. As said above, Intel IDF slides showed a issue width of 3 uops. This would be enough, given above limitations and the fact that not every pair of ALU instruction can execute after 0.5 clocks, to feed the reorder buffer, because the execution stages needs to stall from time to time. Given that a 90kb Trace Cache equals a 16KB I-Cache in PIII in instruction locality, i'm a bit dissapointed

3) Lets compare all this to Athlon's desin.

-Point 1. Athlon has a symmetric approach.
INT: 3 ALU's 3 Adress Generation Units(Load/Store)
So 1 ALU less than Willi, but one more load/store unit.
FP: 2 FP execution units(mulitply/add etc) 1 Load/Store/Misc
(ok there're some limitations regarding pairing of multiplys)

Athlon's single precision fp peak performance is 2*2 sp numbers per clock, same as Willi's 1*4 sp numbers/clock
Athlon's double precision peak performance is 2*1 dp number/clock
Willi's 1*1=1 dp number/clock using normal code or
1*2 dp using SSE2

-Point 2. Direct Path instruction decode in the Athlon to 1-2 Makro Ops(uops). This is less than Willi's 3.5 . So the rate issue width/uops size is for
Athlon 3/1.5=2
Willi 3/3.5=1 resp. 6/3.5=2
Also Athlon has 64KB I-Cache (effectivly increasing with the exclusive cache's in Thunderbird/Spitfire) versus TC equal in locality of a 16KB I-Cache. Watch out for code that exceed this 16KB size. It will perform pourly on Willie, because the whole pipeline starves, until the next subroutine is in the i-cache.

Now, what makes Willi a 7th generation CPU other than Athlon? Is Willi more 7th generation, just because he's got some fancy features like Trace Cache and double pumping.

comments welcomed

Matthias