SI
SI
discoversearch

We've detected that you're using an ad content blocking browser plug-in or feature. Ads provide a critical source of revenue to the continued operation of Silicon Investor.  We ask that you disable ad blocking while on Silicon Investor in the best interests of our community.  If you are not using an ad blocker but are still receiving this message, make sure your browser's tracking protection is set to the 'standard' level.
Politics : Formerly About Advanced Micro Devices -- Ignore unavailable to you. Want to Upgrade?


To: Scumbria who wrote (114685)6/6/2000 8:27:00 PM
From: pgerassi  Read Replies (1) | Respond to of 1573098
 
Dear Scumbria:

It needs to compare 16 tags of 18 bits each. Figure one cycle to get started, one cycle to get the set (8 bits), 4 cycles to compare 72 bits (4 * 18) per cycle, 1 cycle to save the L1 victim, 1 cycle to transfer the first critical word, for a total of 8 cycles after a L1 miss. Then you need 7 cycles more to finish uploading the cache line to L1 and 8 more cycles to download the victim line to L2. Now that is a total of 23 cycles. It appears that you can do the first six cycles of that while the download is in progress, thus leading to an overall latency of 20 cycles (the 3 cycles of L1 miss are included). Now, if the L2 is not being requested to do another lookup while the upload and download are in progress, you will get the next lookup to only have a 11 cycle latency (3 for L1 and 8 for L2 CWF). What is being counted as latency by the program is really wait time for the transfer bus to be freed of about 9 cycles.

Yes, the lookup could be done in one cycle, if all 288 bits can be compared in the right set (256 sets in cache). The transfer subsystem could separate the upload and download busses leading to a 8 cycle improvement in wait time. Also the bus could be increased to 512 bits so that upload and down load could be done in 2 cycles, instead of 16. All of these would result in a L2 latency of about 2 cycles (critical word first is not needed when all words are available at once)(the download can be done simultaneous with the lookup). This also could be done for L1 resulting in a L1 latency of 2 cycles. This however would require large changes in the core, something AMD was not ready to risk at this time. These changes could be in place when the core is redesigned for Mustang.

Pete



To: Scumbria who wrote (114685)6/6/2000 9:54:00 PM
From: Elmer  Read Replies (1) | Respond to of 1573098
 
Re: "8 cycle latency (after 3 clocks of L1 miss) is very long. This is much longer than Coppermine's L2 latency. I wonder what is up? It has a serious impact on performance"

Obviously a superior design....

EP