SI
SI
discoversearch

We've detected that you're using an ad content blocking browser plug-in or feature. Ads provide a critical source of revenue to the continued operation of Silicon Investor.  We ask that you disable ad blocking while on Silicon Investor in the best interests of our community.  If you are not using an ad blocker but are still receiving this message, make sure your browser's tracking protection is set to the 'standard' level.
Technology Stocks : Advanced Micro Devices - Moderated (AMD) -- Ignore unavailable to you. Want to Upgrade?


To: Tenchusatsu who wrote (69576)1/30/2002 6:24:28 PM
From: combjellyRead Replies (1) | Respond to of 275872
 
"actually for a 4-way Hammer system, the latency is worse than you think."

An average of 160ns(for an 8 way) doesn't sounds too bad.

platformconference.com
Look at slide 45.



To: Tenchusatsu who wrote (69576)1/30/2002 6:28:16 PM
From: Joe NYCRespond to of 275872
 
Tenchusatsu,

actually for a 4-way Hammer system, the latency is worse than you think. Each time a memory access is initiated, the other processors need to be probed. The access cannot be retired until all of the probes are resolved.

Thanks for the info.

It's curious why AMD would choose a ring topology instead of a mesh like IBM's POWER4.

It just looks like a ring when there are only 2 HT ports for communication between processors. 3 would make all processors of a 4-way system 1 hop away, and would limit 8 way system to no more than 2 hops away.

But then again, I believe 4-way and 8-way Hammer will be cancelled anyway. ;-)

If it was up to me, I would only develop a 4 or 8 way system if there is a partner (OEM) who will sign up to buy a system like this. Without a partner, they should concentrate on 2 way systems, or possibly 2 ways x 2 cores per CPU.

Joe



To: Tenchusatsu who wrote (69576)1/30/2002 8:42:38 PM
From: pgerassiRespond to of 275872
 
Dear Tench:

You are forgetting a big help is the VM portion of the OS. It keeps track of where each page is loaded and thus knows where to get any given access. The AGUs take a virtual address and find the appropriate hardware address. That address could be split into two fields, the most significant field holds the node id of the memory controller holding that memory page. The least significant field is the address to be given to that memory controller. Since invalid addresses are caught by the AGUs of the requesting processor, no probing needs to be made except at system startup by the OS. Writes in this case are faster than reads since both the address and data move unidirectionally. In read accesses, the address moves outbound from the local core and the data moves inbound.

Secondly, the routing of these command (address) and data packets can be determined ahead of time and can be a s simple as a 256 (IIRC) entry 2 or 3 bits wide table where the node id is used as the entry address and the bits to show which HT link to use with 0 being the local port of course.

Further detail merely follows standard routing practices used in ethernet switchers.

Pete



To: Tenchusatsu who wrote (69576)1/30/2002 10:05:14 PM
From: milo_moraiRead Replies (1) | Respond to of 275872
 
<font color=red>Coherent Cache-Tench you don't seem to know much about AMD's MP systems and it seem like a WAG IMO.

In most SMP systems, the individual CPUs monitor for requests across the FSB and return the data if it is present within the CPU’s cache. For example, let’s take a dual processor Athlon MP system with two Athlon MP CPUs: CPU0 and CPU1. First, CPU0 requests a block of data that is contained within main memory and not within CPU0’s cache or CPU1’s cache. The data is delivered from main memory, through the North Bridge, up to the CPU that requested it, in this case CPU0.

Then, CPU0 requests another block of data that is located within CPU1’s L2 cache. CPU1 is always monitoring (also called snooping) the FSB for requests for data; this time around, the data is in its cache and it sends it out. Now there are two ways of getting the data to CPU0: it can either be written to main memory by CPU1 and read by CPU0, or it can be transferred directly from CPU1 to CPU0.

In the case of a Shared Front Side Bus (see right), where all of the CPUs in a MP system share the same connection to a North Bridge, inner-CPU communication must be carried through main memory, which was the first example we gave. In the case of a Point-to-Point Front Side bus, where each of the CPUs get their own dedicated path to the North Bridge, inner-CPU communication can occur without going to main memory, simply within the North Bridge.

The Shared FSB and Point-to-Point FSB aren’t functions of the CPU; all the Athlon MP can do is make sure it works with a particular protocol. Instead, this is a chipset function, and in the case of the 760MP, it implements a Point-to-Point bus protocol. This helps reduce memory bus traffic since all inner-CPU communication occurs without even hitting the memory bus. For comparison’s sake, all MP chipsets for Intel processors use a Shared FSB including the recently released i860 chipset for the Intel Xeon. It is arguable whether or not the ability to direct all snooping traffic internally within the North Bridge helps performance; all indications seem to point to this being a feature that is nice to have but not necessarily a performance booster.

Another benefit of the Athlon MP’s EV6 FSB is that there are two unidirectional address ports (address in and address out) and one bidirectional data port in every EV6 bus link. This means that an Athlon MP can snoop for data it needs while fulfilling a data request at the same time. The Pentium 4’s AGTL+ FSB only has a single bidirectional address port and a single bidirectional data port meaning that addresses can only be sent to/from the processor at once, not simultaneously.

Taking our Athlon MP system out for another test, we have the following situation: CPU0 has a block of data in its cache, and CPU1 has the same data in its cache. CPU1 then changes the data that both processors have in their caches after which CPU0 attempts to read that data. At this point the copy of the data stored in CPU0’s cache isn’t the most recent copy; in fact it has been changed since CPU0 pulled it into its cache. Keeping the data in each CPU’s cache up to date, or coherent with one another, is what we mean when we refer to cache coherency.

There are only a couple major cache coherency protocols but many variants of them. By far the most common cache coherency protocol is known as write invalidate. Generally speaking, the write invalidate coherency protocol simply dictates which processor’s cache to invalidate the data in during the event of a coherency conflict. The invalidate function is one that takes place over the address bus alone, meaning that the EV6’s dual ported address bus comes in handy once again, allowing for a cache line invalidate and a data request to be executed simultaneously.

There are many forms of the write invalidate coherency protocol, the most common being a MESI protocol. The four-letter acronym stands for the four states (Modified, Exclusive, Shared or Invalid) that a cache line may take. The meanings of the four states are as follows:



anandtech.com

I don't believe you are correct about hammer latency as AMD doesn't used the same systems that you seem to be used too. You forget HT has a much higher Bandwidth and Lower Latency than EV6 or GTL+ .

tecchannel.de

M.