Dan, Re: "do you know anything at all about how a cache works?"
Maybe the better question is, "Do you know how cache works?" Or maybe it's, "Do you know how to use Math?" Associativity has to do with cache eviction, and how long data remains in cache. The simplest kind of cache is Direct Mapped, where every cacheline directly corresponds with a given memory address.
As an example that I will refer to often, let's say we have 128KB of cache, that will be able to contain at most 1,024 cachelines of size 128B, as long as each one has a unique alignment. Thus, if the cache maps to bits 7-16 of the system address pins, then the data in memory addresses 0x1000_0000 through 0x1001_FF80 can fit in the cache all at the same time. If you read sequentially, you can get every single one of those cachelines owned by the cache, but as soon as you read from memory address 0x1002_0000, the first cacheline gets evicted. On the other hand, if you were to read by multiples of 0x1_0000 starting at address 0x1000_0000, you would get two cachelines in cache, and then the third read to memory address 0x1002_0000 would still evict the first cacheline.
The idea of adding associativity means that you can have multiple cachelines with the same alignment all reside in cache simultaneously. A cache with 4-way associativity can read from memory addresses 0x1000_0000, 0x1002_0000, 0x1004_0000, and 0x1006_0000, and still have them all fit in the cache at once. However, even though the cache has a larger associativity, it will still only hold 1,024 cachelines. Thus, if you read sequentially from 0x1000_0000 to 0x1002_0000, you would still be evicting another cacheline. Period.
The idea behind larger associativity caches is that applications do not always read sequentially from memory. Many times, data resides in arrays, which lie across alignment boundaries. In a Direct Mapped cache, you will often come across a case in the above example where only one piece of data will reside in the entire cache region. If a program has arrays containing data that are each 128KB in size, such as the texture maps for a brand new game, then all reads will be along the 0x2_0000 boundary, and thus every read will also evict the previous line from cache.
But then again, few applications actually read perfectly along boundaries, either. There is a balance. Because larger associativities have overheads involved in the lookup state, designers try not to design their caches associativities too large. On the other hand, larger associativities account for lower eviction rates on aligned accesses, especially on applications that read data along regular boundaries. Therefore, there is a sweet spot in cache design. It might be obvious that the Itanium designers felt the L3 cache was large enough that large associativities would impact cache lookup times, yet the size of the cache was large enough to warrant a smaller associativity.
But now let's look at a few things that you are clearly confused about.
First, you say that the Athlon cache is effectively 325% larger than the Itanium cache. Wrong! Associativity does not effect the size of the cache, only the eviction rate, and that depends on whether cache accesses are along a boundary or sequentially. Random accesses should be considered practically sequential as far as caches are concerned.
Second, you say that this is true for a lot of code. Wrong! Most code does not have large gaps in their memory accesses. Those that require large gaps, such as loading textures in a video game, purposely try to affect the size of their data structures so that data is purposely misaligned. Associativity will improve performance on the average program simply because of theories of locality that I won't get into, but performance will only improve significantly as long as the cache lookup times aren't adversely affected by the larger associativity.
Third, even assuming that some magic application causes data to be aligned to every 128KB boundary exactly, such that every read from memory on an Itanium and Athlon system is a multiple of 0x2_0000, the Athlon would be able to fit exactly 32 cachelines of data in cache before older ones start getting evicted. That's obtained by the size of the cache being 2x that of the cacheline read boundary, multiplied by the 16 ways of associativity. Likewise, the Itanium would be able to fit 128 cachelines of data, because 4MB is 32 times the size of the 128KB boundary, and there are 4 ways of associativity (32 * 4 = 128). Now, if you were to go further, and find me a program that only accesses data on a 4MB boundary, I'd admit that the Athlon would fit 4x the number of cachelines of data in the cache as compared with the Itanium, but I'd also have to ask you where the hell you found such a program.
Fourth, you try to claim performance benefits of the Athlon cache. You're wrong on this account, too, because Anand has shown in testing that the Athlon has an exceptional amount of latency in their L2 cache, and this is no doubt due to the overhead involved with the larger associativity. Go ahead and look up his 1.8GHz Pentium 4 review. I believe that's where I saw it. Of course, the large size of the Itanium cache probably has its own overhead, but the idea is that server applications need a higher hit rate cache, and the 4MB of Itanium L3 cache would have a higher hit rate than the 256KB of Athlon L2 cache, any way you look at it.
Fifth, you think that Intel can lose server business to the Athlon because of these cache issues. Wrong again. Like I was just saying, high end applications do work that is along the lines of data retrieval. There are a lot of random accesses, and the more cache, the better. AMD ought to put more cache on their CPUs if they want to compete in this segment, but alas, they cancelled their Mustang project, and why do you suppose that is? Because AMD is the one having problems with cache manufacturing.
Which brings me to my final point. You claimed that Intel can not keep up with cache design, because they can't design caches with high associativity. This is complete and utterly wrong horse sh!t! It's easy to create a cache with more associativity. Intel could have made their caches fully-associative if they wanted, but the overhead involved with that is tremendous. As it is, loads of testing and research determines a sweet spot for the size of the caches, the associativity, and the bandwidth required for each individual micro-architecture. Intel has engineers working on this that are a hell of a lot smarter than you are. So you can stop pretending that you are master of the P4 micro-architecture, and learn to live with it when someone who knows what they are talking about tells you that you are arguing from your back end.
wanna_bmw |