No, Barcelona doesn't have the same feature. It has a different feature.
A Barcelona core (A) with a cache miss still communicates first with the core (B) whose memory controller owns the memory in question. That core then speaks to the others (C, D, E . . .)
The (small) optimization is that if C, D, E . . . get back to A with their results before B, and the cache line is in the M or O state, then core A in Barcelona realizes that it knows what B is going to say, so it doesn't bother to wait. K8 would have waited.
The reason it's a small optimization is that it is an unusual situation that would make B be the slow one in getting back to A. After all, B knows what it wants to say before C, D, E do. So usually it will be the first one to answer. It will be last, only when there's a particularly heavy load on the B->A hypertransport link, or when it is more hops away than the others.
The fact that it's a small optimization is probably why it was not included in K8. Or maybe it gets to be more significant with more cores, and that's the reason they didn't bother with it on K8.
The Intel CSI scheme is very different and much more complicated. In th AMD scheme, serialization happens always at the node that owns the memory. In the Intel scheme, that's not true. So the Intel scheme can require unwinding transactions occasionally to guarantee proper ordering.
The Intel schem is much more complex, but can potentially yield significant performance advantages. |