To: Scumbria who wrote (107373 ) 4/24/2000 9:39:00 AM From: pgerassi Respond to of 1585421
Dear Scumbria: I was thinking of on die L2 (combined). One cell consisting of one bit of storage (currently 6 transistors), one nand gate (3 transistors), and one nor gate (3 transistors), is used in a regular array of 208 by 128 cells, (26 x 1024). A standard 64K bit SRAM array is just 256 x 256 cells plus row address and column decodes plus control overhead. Thus, this is not hard. Splitting each input address line into 8 lines uses 8 inverters (3 transistors and 2 delays). Each column is separated into 16 regions with 16 inverters each (2 more transistor delays), each region drives 8 cells or 16 inputs (2 for the 2 gates in each cell) for a total of 6 transistor delays (probably could same some transistor delays. Each row of bits has a 52 input nor gate below (just a line of transistors) for 56 more transistors. One output feeds up to 10 diodes for the successful address. The other output to feed an Open Collector Line to show success. At the bottom, the eight 10 bit addresses are combined to get a 10 bit output address and a one bit success flag. Since the LRU portion consists of another 42 storage cells plus 40 cells of LRU algorithm address decode. In addition the 512 cells of cache data can be added. Each line can be extended to cover about 313 double row cells. Thus the final size of a section is 256 cells high by 313 cells long. There will be 8 sections, say 4 high by 2 wide. Plus some smaller more irregular section for LRU and L2 cache control. Since the input has 6 transistor delays, the comparator cell has 2 delays, and the output adds 5 more delays, the total is about 13 transistor delays. If the cache data is written to the top of the array and is read from the bottom, the line delays should be roughly the same. Since one clock period probably consists of more than 13 transistor delays, this logic should take no more than one clock period. Since the cache data is co-located with the associative and LRU logic, the timing is no different than the current timing for regular SRAM. Normal SRAM uses a fixed address by mask instead of the address storage flip flops. So your objections from a design standpoint have been dealt with for a long time. SRAM already needs at least 8 input and (or nor) gate. The cell simply replaces the mask for each bit. IBM uses data cache logic to increase the performance of the AIX disk caches. So one resource for caching types is all the algorithms used for disk caching. The real reason why FACs (in the above case a 1024 way cache), is that most application programs are built with libraries of separately compiled functions. When they are linked, you get a scattering of many compact localized regions with in the executable. A typical executable, especially a GUI, is about 10M bytes. A 1 way cache would need to be 10M bytes in size. There is probably 16 to 40 of about 2K to 10K functions (in all libraries) where most of the time is spent. These are usually in groups, 1 for the GUI, 1 for the back end, 1 for the GUI library, and 1 for drivers and kernel. This application would fit in a 4 way cache of 128K to 512K (depending on the summed totals of the 40 functions in area) This assumes that the hash is very effective. When it is not or that the local groups are more spread out, more ways would be needed. And this is for a single user case. In a server or multi-tasking environment, 10 or twenty applications would be running each with a different footprint. In this case 100 or more functions are used spread to 20 to 40 regions and only a cache that can hold 100 to 200 ways can hold it. Since most functions are about a few hundred bytes in size and most are for error handling code, each function probably uses 5 to 10 cache lines. thus a 64K byte FAC LRU of 64 byte lines, would be able to hold the working set where a 512K byte 4 way cache would thrash a lot. For high end apps, workstations and servers, a LRU FAC is most needed. For games and simple number crunching, a 4 or 8 way, (even direct mapped), cache would suffice. Pete