To: Scumbria who wrote (114685 ) 6/6/2000 8:27:00 PM From: pgerassi Read Replies (1) | Respond to of 1573098
Dear Scumbria: It needs to compare 16 tags of 18 bits each. Figure one cycle to get started, one cycle to get the set (8 bits), 4 cycles to compare 72 bits (4 * 18) per cycle, 1 cycle to save the L1 victim, 1 cycle to transfer the first critical word, for a total of 8 cycles after a L1 miss. Then you need 7 cycles more to finish uploading the cache line to L1 and 8 more cycles to download the victim line to L2. Now that is a total of 23 cycles. It appears that you can do the first six cycles of that while the download is in progress, thus leading to an overall latency of 20 cycles (the 3 cycles of L1 miss are included). Now, if the L2 is not being requested to do another lookup while the upload and download are in progress, you will get the next lookup to only have a 11 cycle latency (3 for L1 and 8 for L2 CWF). What is being counted as latency by the program is really wait time for the transfer bus to be freed of about 9 cycles. Yes, the lookup could be done in one cycle, if all 288 bits can be compared in the right set (256 sets in cache). The transfer subsystem could separate the upload and download busses leading to a 8 cycle improvement in wait time. Also the bus could be increased to 512 bits so that upload and down load could be done in 2 cycles, instead of 16. All of these would result in a L2 latency of about 2 cycles (critical word first is not needed when all words are available at once)(the download can be done simultaneous with the lookup). This also could be done for L1 resulting in a L1 latency of 2 cycles. This however would require large changes in the core, something AMD was not ready to risk at this time. These changes could be in place when the core is redesigned for Mustang. Pete