<font color=red>Coherent Cache-Tench you don't seem to know much about AMD's MP systems and it seem like a WAG IMO.
In most SMP systems, the individual CPUs monitor for requests across the FSB and return the data if it is present within the CPU’s cache. For example, let’s take a dual processor Athlon MP system with two Athlon MP CPUs: CPU0 and CPU1. First, CPU0 requests a block of data that is contained within main memory and not within CPU0’s cache or CPU1’s cache. The data is delivered from main memory, through the North Bridge, up to the CPU that requested it, in this case CPU0.
Then, CPU0 requests another block of data that is located within CPU1’s L2 cache. CPU1 is always monitoring (also called snooping) the FSB for requests for data; this time around, the data is in its cache and it sends it out. Now there are two ways of getting the data to CPU0: it can either be written to main memory by CPU1 and read by CPU0, or it can be transferred directly from CPU1 to CPU0.
In the case of a Shared Front Side Bus (see right), where all of the CPUs in a MP system share the same connection to a North Bridge, inner-CPU communication must be carried through main memory, which was the first example we gave. In the case of a Point-to-Point Front Side bus, where each of the CPUs get their own dedicated path to the North Bridge, inner-CPU communication can occur without going to main memory, simply within the North Bridge.
The Shared FSB and Point-to-Point FSB aren’t functions of the CPU; all the Athlon MP can do is make sure it works with a particular protocol. Instead, this is a chipset function, and in the case of the 760MP, it implements a Point-to-Point bus protocol. This helps reduce memory bus traffic since all inner-CPU communication occurs without even hitting the memory bus. For comparison’s sake, all MP chipsets for Intel processors use a Shared FSB including the recently released i860 chipset for the Intel Xeon. It is arguable whether or not the ability to direct all snooping traffic internally within the North Bridge helps performance; all indications seem to point to this being a feature that is nice to have but not necessarily a performance booster.
Another benefit of the Athlon MP’s EV6 FSB is that there are two unidirectional address ports (address in and address out) and one bidirectional data port in every EV6 bus link. This means that an Athlon MP can snoop for data it needs while fulfilling a data request at the same time. The Pentium 4’s AGTL+ FSB only has a single bidirectional address port and a single bidirectional data port meaning that addresses can only be sent to/from the processor at once, not simultaneously.
Taking our Athlon MP system out for another test, we have the following situation: CPU0 has a block of data in its cache, and CPU1 has the same data in its cache. CPU1 then changes the data that both processors have in their caches after which CPU0 attempts to read that data. At this point the copy of the data stored in CPU0’s cache isn’t the most recent copy; in fact it has been changed since CPU0 pulled it into its cache. Keeping the data in each CPU’s cache up to date, or coherent with one another, is what we mean when we refer to cache coherency.
There are only a couple major cache coherency protocols but many variants of them. By far the most common cache coherency protocol is known as write invalidate. Generally speaking, the write invalidate coherency protocol simply dictates which processor’s cache to invalidate the data in during the event of a coherency conflict. The invalidate function is one that takes place over the address bus alone, meaning that the EV6’s dual ported address bus comes in handy once again, allowing for a cache line invalidate and a data request to be executed simultaneously.
There are many forms of the write invalidate coherency protocol, the most common being a MESI protocol. The four-letter acronym stands for the four states (Modified, Exclusive, Shared or Invalid) that a cache line may take. The meanings of the four states are as follows:
anandtech.com
I don't believe you are correct about hammer latency as AMD doesn't used the same systems that you seem to be used too. You forget HT has a much higher Bandwidth and Lower Latency than EV6 or GTL+ .
tecchannel.de
M. |