"And what is the basis for your premise, given the following statement by AMD:
The quad-core chip also will outperform AMD's current dual-core Opterons on "floating point" mathematical calculations by a factor of 3.6 at the same clock rate, he said."
This statement has to be taken literally, which means that it does not mean the whole SPECfp2006. If you listen to the tape syndrome-oc.net you will find the arithmetics behind this statement. First, Amato assumes that addition of two extra FP units (I think the base is a 2-core chip) will increase performance by 2X, but then he trimmed it down to 1.8X due to "potential memory bottleneck". Then he blatantly multiplied 1.8 by 2, arriving at 3.6. This could be true only on certain limited loops, which constitutes maybe few percent of the total run time of a realistic workload, as experience shows. Under "realistic workload" I mean generally acceptable HPC benchmark as SPECfp/int, and certainly not the irrelevant metrics for workstations as "rate". Obviously, to achieve full utilization of 8 FP units, one needs to have a compiler that automatically produces parallel code, which I doubt that AMD has one. Also take into account that the statement was "at the same clock rate". We were told on numerous occasions that quad cores must have reduced frequency in order to fit into reasonable thermal envelope, which means the frequency will be lower, with corresponding penalty to peak performance.
Speaking about theoretical constructions, I remember AMD presentations of K6 capabilities to run 4 instructions per clock. In reality, this number rarely exceeded something like 0.2. This is a typical range of disconnect between an architect world and reality. I think the statements from AMD marketing officials are very irresponsible.
Consider also the following. As it was reported (and ridiculed), AMD has demonstrated the Barcelona A0 running some system manager, apparently under Windows operating system. People need to realize that to get to this point without BSD is a tremendous achievement by itself, the CPU must be capable to execute many billions of instructions without a single error. Obviously, the demo system should be capable to run at least some simple benchmarks like Sandra, or the primitive (but highly reputable) STREAM. So, the question is: why there was not a single number from a real code reported? Maybe because the real performance didn't live to initial expectations and failed to deliver anything faster than existing cores? This makes the marketing statements even more irresponsible.
IMEO, it is impossible to achieve practical speedup of 40% per core without reducing miss penalties by fundamental increase in cache sizes and prefetching technology in compilers, especially without major redesign of the pipeline. The art of x86 ISA implementations has saturated, all hanging fruits of speedups are already implemented, nitpicking one-two clocks here and there will not help in performance with realistic cache pollution.
Sorry John, this is reality, and no reason to shoot a messenger. With this kind of animosity and inflated expectations, you folks are set for a big disappointment.
- Ali |