SI
SI
discoversearch

We've detected that you're using an ad content blocking browser plug-in or feature. Ads provide a critical source of revenue to the continued operation of Silicon Investor.  We ask that you disable ad blocking while on Silicon Investor in the best interests of our community.  If you are not using an ad blocker but are still receiving this message, make sure your browser's tracking protection is set to the 'standard' level.
Technology Stocks : Advanced Micro Devices - Moderated (AMD)
AMD 216.00-0.7%Dec 4 3:59 PM EST

 Public ReplyPrvt ReplyMark as Last ReadFilePrevious 10Next 10PreviousNext  
To: wanna_bmw who wrote (52927)8/29/2001 1:44:52 PM
From: pgerassiRead Replies (2) of 275872
 
Wanna_bmw:

Cache wayness affects cache miss rate. A cache that misses every time is worse than having no cache at all as it adds to latency with no benefit at all. A cache that misses 1 out of every two tries has a hit rate of 50%. With a L2 cache miss, 200+ cycles are required to fill it for TOTL CPUs. Thus, a 50% effective L2 cache would divide clock by ((( cache hit % * cycles used per hit + ( 1 - cache hit % ) * cycles used per miss ) + cycles per cache try ) / cycles used per cache try ) 26 using 200 for a miss, 3 for a hit, 50% hit rate and 4 for cycles per try.

Modern CPUs usually have a 90-95% cache hit rate. Using 90% in above, keeping the others the same, makes for a 6.68 divisor. Using 95% in above makes for a 4.21 divisor a substantial increase. For a P4, where there is no L1 data cache, cycles per L2 cache try are much lower than an Athlon or P3. This makes the P4 require quite a bit lower cache miss rate in L2 than either of the other two. It has been shown that the P3 has about 80-90% L1 cache hit rate and the Tbird and Palomino has about a 90-95% L1 cache hit rate. This multiplies cycles per L2 cache try by 5 to 20 times for these two.

Now, the use determines the number of ways the cache must be not to adversely affect the L2 cache hit rate. Both instructions and data are fetched through the L2 cache in the P4. The ideal case is for the entire memory to be in cache. A very small hit is taken when the cache is shrunken to just above the working set size. Wayness effects the actual working set size of a system. Any system that does multitasking, and almost all modern OSes do it, the operating system takes at least 1 way for itself, 1 way for its drivers and 1 way for its data and the foreground task takes a way for programs and a way for data. A 2 way cache is ok for use within a task because for both the P3 and Athlon there are separate 2 ways for instructions and data. For P4, you hope that L1 and the trace cache eliminates the need for the 2 ways for instructions from L2 so that 8 way is enough. Thus, an 8 way L2 cache can allow 3 to 4 tasks to be simultaneously run. A 16 way cache increases this to 7 to 8 tasks. This is why a multitasking system runs better on an Athlon than either P3 or P4.

Now wayness and size effects the geographical boundaries of caches. There is two typical methods to hash the physical memory area to cache bounds, the first is to divide the cache size straight into the addressing range you want to cache and then the wayness exactly divides the memory segment sequentially in each wayness subsegment. The second is to divide the wayness into the cache size and use that to divide the addressing range. Using the first on a 8 way 256K L2 128 byte cache line for a 4GB addressing range, you get 16,384 regions and 1024 bytes per region for a complete way set of 8 cache lines per region or way set being determined by addressing bits 17 to 10. Using the second on the same parameters, you get 131,072 regions and 128 bytes per region for a way or 1 cache line or way set determined by bits 14 to 7. Wayness other than those that = 2^n where n is a integer >= 0, requires use of the second method.

Now, it would seem that the method dividing the memory into a hashed region holding a single way set has no bearing on cache effectiveness but, you would be wrong. Most modern OSes demand page memory for both allocation to tasks and to swap virtual memory to disk. The later is rarely done now due to the cheapness of memory. But, the former is why the first method is much worse than the second in that it neutralizes the wayness as a single page can chew up all of the available ways whereas the second only chews up one way. In the above example, the first method holds all the ways in just one 256K region and the first few K of that are in one page (typically 4K in most x86 systems). The second method only loses at most one way to any given page. Heaven help you if 2 allocated pages in your working set happen to be on the same 256K boundary. A near 0% hit rate occurs. Very bad!

There are many more things that go wrong in typical OS allocation methods as far as increasing cache wayness requirements. Quadrupling cache sizes yields a halving of cache miss rate is a good rule of thumb as long as wayness is not cut. Theorectically the best a cache size could do to cache miss rate in a fairly typical gaussian environment is approximately the square root of the ratio between the cache size below and this one. Wayness does has the same square root effect. Both should be geometrically meaned for overall effect. Thus, given a 16K 2 way L1 cache, a 256K 8 way cache would have about a 2.8x reduction in miss rate for the size and a 2x reduction in miss rate for the way for an average of 2.4x reduction. A 256K 16 way cache would have an average of a 2.8x reduction in cache miss rate. Exclusive caches increase both the effective wayness and the effective size for a small latency penalty. A 64K 2 way L1 and a 256K 16 way L2 get a size based miss reduction of 2.24x and a wayness reduction of 3.2x (20 ways (2 pairs plus 16) from L2 perspective) for an average of 2.7x reduction.

Thus wayness effects both the boundary and the average miss rate reduction. A n way or for above example a 256K 4K way L2 cache would have a wayness reduction of 32x for an average of 8.5x reduction. A 2M 8 way inclusive would have an average reduction of 3.36x. To get the same 8.5x reduction in a 8 way inclusive requires a size of 80MB cache. To get the same 3.36x reduction in a n way inclusive cache, you would need only 64KB 1K way L2. This shows the power of a completely associative cache.

So ways do matter and the missing L1 data cache in the P4 does affect its speed negatively. I wonder why given the trace cache, that they didn't use the 8K L1 for data instead of instructions?

Pete
Report TOU ViolationShare This Post
 Public ReplyPrvt ReplyMark as Last ReadFilePrevious 10Next 10PreviousNext