To: Rink who wrote (238785 ) 8/15/2007 3:24:08 AM From: pgerassi Read Replies (1) | Respond to of 275872 Dear Rink: L3 cache already has two levels to check for the presence of a address to be in cache. It has already taken 9 CPU cycles to find out that it is not in L1 or L2 caches local to the core. ZRAM has a access time of about 1ns or so. For a 2GHz core that is only 2 cycles to SRAM's one cycle. At 4Ghz, its 4:1. Now the L2 uses 8 cycles to check if an address is within cache at 2 sets a cycle. If you check 8 sets in a cycle then the check only takes two access cycles. Similarly for K10's L3 which is to be 32 way set associative. If you check all 32 sets at the same time, it will only take 4 CPU cycles at 4GHz. AS to moving data in and out, you can use a multiplexor to push the 576 bits (64 bytes plus ECC) through quickly to the SRAM L1 and L2 in an access cycle. So even L3 only needs an additional 3 access cycles (one to see if its in L3, one to move the resulting cache line to L1 and one to move the evicted line back into L3) or just 12 more CPU cycles at 4GHz. Parallel to that is the lookup in the other core's L1 and L2 caches. So in a total of 21 cycles at 4GHz, you know that the data requested is not within the cores or L3. One of the reasons why the SRAM caches look up addresses at a small amount of ways per cycle is that high speed SRAM uses quite a bit of power. Because ZRAM is slower, the logic checking the ZRAM uses far less power. Since we see that CPUs consume 5-10 times the power to go twice as fast, logic that is 20% of the speed should only take 1/50th the power. 16 times that is still saves more than 2/3rds of the power yet performs the check 3.2 times faster. The big savings of ZRAM comes in the die area used for a given size cache where the bigger that is, the more advantageous ZRAM is over SRAM. This has two reasons. One is that as the cache gets bigger, the wireline delays start to predominate over the inherent access time of the memory. Two is that the slower lower power CPUs have less of a disadvantage between their SRAM and ZRAM. Where the above is 21 cycles for the 4GHz case, at 2GHz, the above is 15 cycles. At 1GHz, SRAM holds no access advantage over ZRAM while ZRAM has lower wireline delays. How big would a 30MB L3 SRAM cache Shanghai be at 45nm? ~610mm2. A 30MB ZRAM L3 cache within a Shanghai would be the same size as a 6MB SRAM L3 cache Shanghai, about 270mm2. Which would you rather have, a 6MB L3 K10 QC or a 30MB L3 K10 QC with another 3 cycles more of L3 latency at the same price. Most would opt for the latter for server and HPC work. Pete