SI
SI
discoversearch

We've detected that you're using an ad content blocking browser plug-in or feature. Ads provide a critical source of revenue to the continued operation of Silicon Investor.  We ask that you disable ad blocking while on Silicon Investor in the best interests of our community.  If you are not using an ad blocker but are still receiving this message, make sure your browser's tracking protection is set to the 'standard' level.
Technology Stocks : Advanced Micro Devices - Moderated (AMD) -- Ignore unavailable to you. Want to Upgrade?


To: Ali Chen who wrote (227214)3/2/2007 5:08:02 AM
From: PetzRead Replies (3) | Respond to of 275872
 
Nonsense, Amato did not ever say 3.60x, he said 3.2x as the speed improvement for pure HPC applications. I never heard him say that was for SPECfp_rate, so who knows where that non-factoid came from. But thanks for the link, he was misquoted all over the place.

And, by the way, you are forgetting that better prefetch can significantly improve floating point performance. The BW to L2 cache doubled so it is mostly high memory bandwidth stuff, where all 4 cores' L2 can't kept filled, that can't meet the 3.2x factor in pure HPC-type stuff.

Petz



To: Ali Chen who wrote (227214)3/2/2007 7:11:17 AM
From: RinkRead Replies (1) | Respond to of 275872
 
re: x86 ISA implementations has saturated, all hanging fruits of speedups are already implemented.

Ali, the x86 ISA is about to be expanded for gen purpose unified shader functions to be integrated in Fusion. The extentions to the ISA will probably become public 1-2 years before Fusion becomes available (like was the case with the x86-64 / AMD64).

Every new process node brings new low-enough-hanging-fruit type opportunities for speedups, such as K10 and Fusion.

Your generalization is a bit too simplistic.

Regards,

Rink



To: Ali Chen who wrote (227214)3/2/2007 8:52:23 AM
From: DDB_WORead Replies (3) | Respond to of 275872
 
Ali - Amato said 3.2X *)

You made a good point about IPC:
Speaking about theoretical constructions, I remember AMD presentations of K6 capabilities to run 4 instructions per clock. In reality, this number rarely exceeded something like 0.2. This is a typical range of disconnect between an architect world and reality. I think the statements from AMD marketing officials are very irresponsible.

This is, what most people seem to miss: that IPC is not the max. sustainable throughput of a CPU's decoding/execution/retirement, but of the system as a whole. If there are too many cache misses, mispredicted branches etc., a wonderful assumed IPC of 3 or 4 might drop down sharply. I just remember Mitch Alsup stating an (assumed, but likely close to the mark) IPC of 1.0 for Opteron in a usenet discussion.

E.g. (for all, not you since you understand this concept) if a bunch of code (say 10 instructions) relies on some data, which isn't in the L1, then the code stalls for e.g. 14 cycles until it could continue. If execution has a throughput of 3 IPC, then this sequence would take 14+ceil(10/3)=18 cycles on the 3 IPC machine and 17 cycles on the 4 IPC machine.

That's the reason, why OOO loads, improved prefetchers (+those in the NB), 32-way L3 cache, separate memory channels and so on are important in this equation, while things like LZCNT/POPCNT won't do anything for Barcelona in the first quarters.

BTW, Scientia from AMDZone also listened to this interview (completely, while I didn't) and found:
One interesting comment was that Intel's current memory access latency is 100ns while AMD's is 70ns and he suggested that AMD will be at 55ns soon. I assume he was talking about K10. I'm now wondering where the improvement will come from. I wonder if this could be related to the new direct to L1 prefetch versus the older load to L2 prefetch.

Here I just think, that 15ns (30+ cycles) would be too much just by saving the L2 access for prefetched data.

*) At 1/3 of full length of the video interview.



To: Ali Chen who wrote (227214)3/2/2007 5:37:39 PM
From: pgerassiRespond to of 275872
 
Dear Ali:

Obviously, to achieve full utilization of 8 FP units, one needs to have a compiler that automatically produces parallel code, which I doubt that AMD has one. Also take into account that the statement was "at the same clock rate".

Haven't you been listening? Sun's Studio C compiler has the Autopar option which can split a single thread into multiple subthreads perform the operation on multiple parallel cores and combine the results. So 3.6 times at the same clock rate is possible with OOO, twice as wide load and store bandwidths, dual sets of FPUs per core and twice the cores. The only question is how much the clock speed has decline to have the same TDPmax.

For previous generations that about 1.5 speed bins or 300MHz. With a 2.6GHz 65nm Brisbane, that puts it at 2.3GHz. Of course that is on a early 65nm process. Later on 65nm SOI should get towards 3.3GHz for DC, so 3.0GHz for QC is possible. But talk about going early to 45nm high K metal SOI might reduce those clocks when 65nm SOI EOLs.