I am sure the things are much more complicated than that, but the bottom line is that the reality test of embedded controller concept did not reveal tremendous superiority of this approach, and was successfully countered by Intel with bigger caches, hardware prefetches, and processor-specific compiler improvements.
Well, that about sums it up, and I'd agree: It's not "tremendously" superior, just "somewhat" superior... In a 1S system.
The margin of superiority is small enough that "bigger caches, hardware prefetches, and processor-specific compiler optimizations" are enough to overcome it... In a 1S system.
Clearly, the combination of IMC and DCA have demonstrated actual "tremendous superiority" as system scale increases. It takes a lot more (specialized dual-FSB chipset & MB) to overcome the native capabilities of a 2S Opteron system.
In 4S, forget it.
You've got it backwards, IMHO; AMD doesn't need Intel-like caches to catch Intel, Intel needs Intel-like caches to catch AMD (and it only works in small-scale systems, where AMD's IMC provides it the least benefit.)
I think AMD's L3 victim cache will help them in all the right ways, improving single-thread performance where Intel's big shared caches are giving them the most advantage.
And then there's ZRAM hanging out somewhere over the horizon...
fpg |