Fyo, <But how do you make sure that data that has to be accessed by processor#1 is located in processor#1's memory more than 50% of the time? Or is that just tough luck?>
It's not really tough luck, but more of a load-balancing issue. With fine-grained interleaving, it's obvious that 50% of a processor memory accesses will be to the remote processor. This impacts average latency, but at least you can divide the memory bandwidth evenly. This is the easier method.
If the system is configured for coarse-grained interleaving, then the OS can arrange the data so that most of a processor's memory accesses are local. This is a ccNUMA technique commonly used in large systems of 16 processors or more, where the processors and memory are divided up into nodes. In the case of Clawhammer, each processor would be considered its own node because of the local memory controller.
Unfortunately, the OS will need enhancements to support ccNUMA optimization. And besides, the benefit wouldn't all be that great anyway, at least in a 2-way Clawhammer system. That's because the difference in latency between local and remote is just one hop. Compare that to traditional ccNUMA systems, where the difference between local and remote can be huge.
So a 2-way Clawhammer system will probably forego ccNUMA optimization. Like I said before, latency will be impacted, but bandwidth will scale nicely, and in servers, usually bandwidth is more important than latency.
Tenchusatsu |