SI
SI
discoversearch

We've detected that you're using an ad content blocking browser plug-in or feature. Ads provide a critical source of revenue to the continued operation of Silicon Investor.  We ask that you disable ad blocking while on Silicon Investor in the best interests of our community.  If you are not using an ad blocker but are still receiving this message, make sure your browser's tracking protection is set to the 'standard' level.
Technology Stocks : Advanced Micro Devices - Moderated (AMD) -- Ignore unavailable to you. Want to Upgrade?


To: graphicsguru who wrote (239604)8/30/2007 6:11:26 PM
From: combjellyRead Replies (2) | Respond to of 275872
 
"Are you saying that the corner cases are so rare that from a performance perspective, it doesn't matter if they're really slow?"

Almost. What I am saying is that such cases in a large NUMA system are going to be really slow because there could be a large number of hops between any two sockets. So doing speculative cache coherency is going to risk having a huge number of instructions being executed before finding out if it is wrong. Given that x86 code has a branch every 8 or so instructions(IIRC), then odds favor any speculative cache coherency being wrong several times before even the first notice hits in a large NUMA system. So the processors could quickly get to the point where they are unwinding all the time.

I think that CSI can support up to 2048 sockets. Now a typical hierarchal system will cluster 4 sockets in a local cluster. Beyond that, the rules of the topology comes into play. Now, in the case of CSI, there are a max. of 512 clusters. Assuming that the topology is a torus, you have a 32 by 16 grid or a max. of 18 hops to access a cluster and a max. of 20 hops for the worst case to get from one socket to another.

In 20 hops a lot can happen. While it should be ok to do speculative cache coherency between the sockets in a cluster, or even between the sockets in a local neighborhood of clusters, doing it system wide is asking for trouble. It is better to just block.

And different topologies can mean different max. hops. But, what can be implemented is dependent on how many CSI links are available. Given the granularity of the lanes is a minimum of 5, there is a practical upper limit to the number of CSI links that can fit on a socket.