SI
SI
discoversearch

We've detected that you're using an ad content blocking browser plug-in or feature. Ads provide a critical source of revenue to the continued operation of Silicon Investor.  We ask that you disable ad blocking while on Silicon Investor in the best interests of our community.  If you are not using an ad blocker but are still receiving this message, make sure your browser's tracking protection is set to the 'standard' level.
Technology Stocks : Intel Corporation (INTC) -- Ignore unavailable to you. Want to Upgrade?


To: wanna_bmw who wrote (162713)3/21/2002 2:51:32 PM
From: Tenchusatsu  Read Replies (1) | Respond to of 186894
 
WBMW, AMD can go with higher-associative caches because they typically run their processors at lower clock speeds than Intel. In other words, it's easier for them to increase the associativity, at least when you look at the problem from this angle alone.

Don't worry about what the 'Droids will claim. When they see favorable results for AMD's processors, they'll point to some well-marketed feature and claim, "That's the reason!" It doesn't matter whether that feature actually has any effect or not, because the truth takes a back seat to their need to believe fantasies.

By the way, if you want to battle armchair experts, it might help to get minor facts straight. An 8 KB fully associative cache will not have 8192 ways, unless you think the size of each cacheline is one byte. ;-)

Tenchusatsu



To: wanna_bmw who wrote (162713)3/21/2002 2:55:54 PM
From: dale_laroy  Read Replies (1) | Respond to of 186894
 
>Few people here that claim to understand this stuff really do, so I would take most comments, even those from Dale LaRoy, with a grain of salt.<

Obviously, you are among those that fail to understand this stuff. For example:

>"Full Associativity" means that the "way-ness" of a cache is equal to its size. Thus, an 8KB fully associative cache will have 8192-ways.<

Full Associativity means that the wayness of a cache is equal to the number of entries, which depends upon the cache line size. If indeed, the caches being modeled here use a one byte cache line size, which is what would be needed for 8192 ways in an 8KB cache, these tables are useless.

As it is, these tables show no indication that they factor in the effect of hardware prefetch, which would dramatically change the hit rate for a 512KB L2 cache and a 1MB L2 cache with both caches having the same number of entries.

Also note that according to these tables, it isn't just a matter of 8-way being as good as fully associative. These tables actually indicate a higher hit rate with 8-way set associativity than with full associativity. Without some type of an explanation of this discrepancy, I would tend to reject this data outright.



To: wanna_bmw who wrote (162713)3/21/2002 3:27:15 PM
From: Dan3  Read Replies (2) | Respond to of 186894
 
From your post:

The miss ratios were calculated from data collected by functional, user-mode simulations of optimized benchmarks. As a result, the cache miss ratios reported above may not be representative of a real platform. A few sources of error are discussed below.

First, only primary misses were counted by the simulator. Once a reference missed in the cache, the data was loaded and all subsequent accesses to the line hit. A modern processor may also experience secondary misses, or references to data that has yet to be loaded from a prior cache miss. There is a nonzero miss latency, and a real processor may execute other instructions while waiting for the data. The sequential model used in functional simulations is optimistic in this respect.

Second, a modern processor will have optimizations that affect cache performance. Hardware prefetching of instructions and data can have the positive effect of reducing the number of cache misses. However, prefetching can also cause cache pollution. Further, speculative execution can result in increased memory traffic for speculatively issued loads, and I-cache pollution from incorrect branch predictions. This also makes the results optimistic.


Totally invalidating this analysis is the fact that, all other processes, including all the operating system processes, were ignored!

Third, the operating system was ignored. System calls cause additional cache misses to bring in OS code and data, and in doing so they replace cache lines from the user program. This increases the number of conflict and capacity misses for the user program in a real system. Since the additional misses from OS intervention were not modeled, our results are optimistic. One possibility is to flush the caches on system calls. However, this is the other extreme, and would have made it impossible to measure the compulsory miss rates.

Fourth, all prefetch instructions (loads to R31) were treated as normal references. All were executed, and references from prefetch instructions were included in the overall statistics. Although prefetch instructions may prevent (or reduce the impact of) cache misses from instructions in the original code, the misses still occur (just sooner). However, prefetch instructions increase the overall hit ratio because the subsequent loads and stores that hit in the cache add to the overall hit count. One possibility is to ignore prefetch instructions altogether (the Alpha ISA allows this). Another possibility is to count the misses from the prefetches, but not count them as instructions.

Fifth, the benchmarks were optimized for an Alpha 21264 processor. The binaries may have been tuned to perform well with the 21264 cache hierarchy (64K 2-way L1 caches). Ideally, the binary should not favor a particular cache configuration. Further, the binary contains no-ops for alignment and steering of dependant operations in the clustered microarchitecture of the 21264. These no-ops increase the overall instruction count for the functional simulation.