To: Scumbria who wrote (107351 ) 4/22/2000 12:53:00 PM From: pgerassi Read Replies (1) | Respond to of 1573116
Dear Scumbria: The budget for a single data line is 64 bytes times 8 bits times 6 transistors per bit equals 3072 transistors. a 32 bit address minus 6 bits equals 26 bits to store and compare. Each bit needs 6 transistors for storage, 9 for an xor and one as part of a 26 bit and gate. That is 416 transistors plus 3 transistors and 10 diodes to drive the address of a successful compare and a control bit plus 3 transistors to drive all the address bits to all the comparators in parallel. Thus there needs to be about 435 transistors per line overhead for a Fully Associative Cache of 1024 lines of 64 bytes each. Now the LRU aspects can be done with 42 bits of storage, 20 for read, 20 for write, 1 for data valid flag, and 1 for needing a write flag. Also four 10 bit ands are needed to implement the algorithm in parallel (fastest way). Thus for LRU overhead, there needs to be 300 transistors per data line. Thus, overhead of a data line is just 735 transistors compared against 3072 transistors for the actual data storage or about 24% more transistors. The overall LRU cache control will add about a few K transistors to the whole array. Thus the total overhead for a 64K byte cache is about 740K transistors. Each additional 64K bytes of cache adds 3.1M transistors. As to the power requirements, how is this any different than a ALU, FPU or Decoder? Since it can be a regular structure the density will be the nearly the same as the data area. Thus, only 25% more die would be used whereas a 128K cache would use 100% more. As to increased latency, a LRU update takes three cycles. One to unlink from the old place on the list, one to link to the new place on the list and one for the associative match. A four way takes one cycle for each way plus a cycle of the update. However, this can be reduced with additional overhead to two or three cycles depending on update strategy. The crossover in size happens somewhere between 8 to 16 way cache. The crossover in latency also occurs in that range. The crossover in power occurs somewhere from 16 to 64 way. All this of course assumes the same size cache is used. If the 4 way cache is 4 times larger, power and size are higher in the 4 way but, latency is less than a LRU FAC. However when speeds increase, the size starts to affect the latency. Line delay consumes a greater percentage and the hit rate becomes far more important. This is why, in disk caches, where latency of a cache miss is over 1K times greater than the time to compare one address, an LRU FA cache is almost always used. The ever higher CPU clocks to DRAM access ratio will force the use of LRU FACs. Remember Scumbria, you said that the latency of caches is becoming more important. A 64K LRU FAC has less latency than a 256K 16 Way Cache. When the data cache line gets larger due to the increasing ratio of bytes transferred per access first word time, the LRU and FAC overhead shrink. Pete