Thanks for the foils, Rob. A very good look at the Alpha 21364 (if you don't consider the obvious anti-Itanium FUD at the beginning).
By the way, I think this statement is incorrect:
<Why 16 as a choice? Most hops are 2. When you add up the latency for those 2 hops you are still doing much better than an L3 hit and of course main memory.>
You probably meant "L2 hit" instead of "L3 hit." But that's not the point.
The latency to local L2 is always going to be lower than the latency going to another processor's L2. I think you're looking at slide 18 in the presentation, specifically at the statement "15ns processor-to-processor latency." That latency, I think, would be in addition to the latency of the actual L2 access. And furthermore, that latency probably doesn't take into account the latencies incurred with heavy P2P traffic, which is highly variable but must still be considered.
True, the latency of a remote access is much better than traditional NUMA, but it's still NUMA and therefore the software must be optimized for this. This is in contrast to POWER4, where all four L2 caches on the gargantuan MP module are very close to each other and will definitely demonstrate better P2P latency than Alpha 21364. POWER4 will also be UMA, meaning that the software doesn't need to be optimized in a NUMA-style. (Of course, POWER4 does have its disadvantages, but hey, doesn't every architecture?)
Tenchusatsu |