SI
SI
discoversearch

We've detected that you're using an ad content blocking browser plug-in or feature. Ads provide a critical source of revenue to the continued operation of Silicon Investor.  We ask that you disable ad blocking while on Silicon Investor in the best interests of our community.  If you are not using an ad blocker but are still receiving this message, make sure your browser's tracking protection is set to the 'standard' level.
Technology Stocks : Intel Corporation (INTC) -- Ignore unavailable to you. Want to Upgrade?


To: Paul Engel who wrote (162334)3/15/2002 9:07:01 PM
From: Elmer  Read Replies (1) | Respond to of 186894
 
Paul -

I posted a cheerup message to NiceGuy on the Mod thread. For some reason he isn't responding to me???

I hope he keeps posting his losers in real time.

EP



To: Paul Engel who wrote (162334)3/16/2002 8:45:21 AM
From: John F. Dowd  Respond to of 186894
 
Paul: I cited an article from which the commentary was made:

anandtech.com

JFD



To: Paul Engel who wrote (162334)3/16/2002 3:39:12 PM
From: Dan3  Read Replies (1) | Respond to of 186894
 
Re: "AMDroids think that their salvation is to bump the FSB to 166MHz. There's only one thing wrong-- it makes no difference in performance."

Paul responded -> Are you sure of this? I know the Pentium 4 has shown a significant performance increase in going to 533 MHz FSB coupled with 1066 MHz Rambus memory.

It could be the case. Athlon has a shorter pipeline, cachelines half as long, and a 20-way cache vs. P4's 8-way cache.

Running identical software, Athlon puts much less of a load on its memory subsystem. Current memory technology will allow Athlon performance to scale quite a bit higher than at present, while P4's memory bandwidth burning design is already showing signs of hitting a wall.

Why does P4 put more of a load on its memory than Athlon for a given level of performance?

There are a number of reasons, but the driving force behind the design decision that led to this circumstance is the long pipeline of P4. When any CPU has to wait for data or instructions, a "bubble" is introduced into the pipeline. A nice outline of pipeline hazards is here:
stanford.edu

With a pipeline approximately twice as long, P4 stands to lose a lot more from a stall than does Athlon. P4 is further hampered by its limited caching ability, since it can cache reads from only 8 pages for any given LSBs (Least significant Bits), while Athlon can cache reads from 20 pages with the same LSBs.

What do LSBs have to do with it? Think about what a chip has to do to determine if a read is already in cach. It has to compare an address with the addresses of cached locations. A fully associative cache does just that: it compares the address in question with every address in the cache - which is a lot of compares, unless the cache is very small. But compares take time and power. A direct mapped cache is the opposite, there is only one cache location to check for for any given memory address. Set associativity, or "wayness" is the number of possible cache locations for any given memory location.

Why is this done and how does it work?

The cache controller uses the LSBs - least significant bits to determine which cache locations need to be checked. It uses the end of the address, the last few bits, as an index or hash code. So, if a PC had a 4-way cache with 256 locations, the cache would basically be divided into 4 pieces, each of which could store 64 locations. It would effectively divide the main memory into 64 blocks, each of which could have, at most, 4 locations stored.

So for any address ending in ******11 0000, say 01101100 11110000, or 01101100 12110000, there are only 4 cache locations available to store that information. This wouldn't matter all that much, except that memory is generally allocated by the operating system in blocks with the same LSBs, and modern OSs and programs taking advantage of object oriented design are loaded as many small modules, each of which is allocated a block (or blocks) of memory by the OS - which often start at the same LSBs.

So, particularly on complex or server applications, P4 (and Xeon, and Xeon MP), even with a very large 8-way cache, is going to be thrashing some of its cache locations long before Athlon, even if Athlon's 20-way cache is smaller. A discussion of cached designs is here:
pcguide.com

The bottom line is that Athlon's and P4's caches make different trade-offs: Athlon's is more complicated and can't respond as quickly (more cycles to read from the cache), but it is also less likely to throw away needed data, forcing re-reads of main memory. P4, OTOH, has a simpler cache that responds quickly (fewer cycles to read from cache) but has to go to main memory more often. This reduces the load Athlon places on its main memory. As chip speeds move higher relative to memory speeds, Athlon performance should scale better, since it goes to main memory less often.

One of the ways P4 makes up for this with an aggressive prefetch stategy - as the program runs, a sort of shotgun approach sends memory read requests for a number of locations in memory that might or might not soon be needed by the chip. But, since P4 is reading a number of locations that will not be used, it ties up memory bandwidth, never using many of the prefetch reads it performs. Athlon also does some prefetch, but less than P4.

P4 also reads locations 128 bytes at a time, while Athlon reads 64 bytes at a time. That means, that when any memory location is read, so are the next 127 for P4, and 63 for Athlon - and room will often have to be made in the cache for all those locations by flushing exising data. Very often, the next 3 bytes are needed (most instructions and much data is at least 4 bytes long), often, a "chunk" of data 16 bytes long is needed. It becomes less and less likely that the following locations are needed as the length of the read is increased. So, this is another area in which P4 uses its memory bus to read locations that aren't later used.

So, it becomes easy to see why P4 does a lot more reading of its memory for a given level of performance than does Athlon, and why P4 is more likely to benefit from an increase in its FSB than Athlon. Basically, P4 becomes bandwidth bound much sooner than Athlon.

P4's designers weren't dumb, its long pipeline design suffers more from the small delays that can be associated with reading from the cache than does Athlon's short pipeline design - so P4's cache makes sense for its design, and Athlon's cache makes sense for its design.

But, the bottom line is that, for a given memory technology, Athlon is capable of considerably higher total performance. To see how much higher, look at your example in which P4 is already "memory bound" with dual channel memory and a 400mhz FSB, while Athlon has "room to grow its performance" through scaling of the CPU clock even with only a single memory channel and a 266mhz FSB.