Ali - Amato said 3.2X *)
You made a good point about IPC: Speaking about theoretical constructions, I remember AMD presentations of K6 capabilities to run 4 instructions per clock. In reality, this number rarely exceeded something like 0.2. This is a typical range of disconnect between an architect world and reality. I think the statements from AMD marketing officials are very irresponsible.
This is, what most people seem to miss: that IPC is not the max. sustainable throughput of a CPU's decoding/execution/retirement, but of the system as a whole. If there are too many cache misses, mispredicted branches etc., a wonderful assumed IPC of 3 or 4 might drop down sharply. I just remember Mitch Alsup stating an (assumed, but likely close to the mark) IPC of 1.0 for Opteron in a usenet discussion.
E.g. (for all, not you since you understand this concept) if a bunch of code (say 10 instructions) relies on some data, which isn't in the L1, then the code stalls for e.g. 14 cycles until it could continue. If execution has a throughput of 3 IPC, then this sequence would take 14+ceil(10/3)=18 cycles on the 3 IPC machine and 17 cycles on the 4 IPC machine.
That's the reason, why OOO loads, improved prefetchers (+those in the NB), 32-way L3 cache, separate memory channels and so on are important in this equation, while things like LZCNT/POPCNT won't do anything for Barcelona in the first quarters.
BTW, Scientia from AMDZone also listened to this interview (completely, while I didn't) and found: One interesting comment was that Intel's current memory access latency is 100ns while AMD's is 70ns and he suggested that AMD will be at 55ns soon. I assume he was talking about K10. I'm now wondering where the improvement will come from. I wonder if this could be related to the new direct to L1 prefetch versus the older load to L2 prefetch.
Here I just think, that 15ns (30+ cycles) would be too much just by saving the L2 access for prefetched data.
*) At 1/3 of full length of the video interview. |