To: Scumbria who wrote (115650 ) 6/13/2000 10:20:00 AM From: pgerassi Read Replies (1) | Respond to of 1576257
Dear Scumbria: It is because you do not understand how the cache latency program works. If you look at the input parameters, one of those parameters is the delta amount, read increment, between Data addresses. This delta is set to ever increasing or in the output I saw, each column's delta was twice that of the column immediately to the left. When the delta was 64 bytes, the latency increased to the 20 cycles quoted. Since the algorithm is executed by a very tight code that may be in the instruction cache, read L1 pipe, at all times, no L2 accesses are required for instruction loads. Thus all activity is in L2 and because of the pipelining, each access to the data, read every CPU cycle, evicts a new L1 cache line and loads in a new cache line from the L2. This does not allow for the normal time given to the L1 to L2 bus to finish transferring the L1 and L2 swap of a cache line. In the L1 to L2 interface on a K75, the write back of a L1 cache line to the L2 was unnecessary due to the duplication and thus, it does not need to write the L1 cache line victim back. Thus, the K75 will show a lower latency for this particular piece of code. However, in general code, this is optimized away since this problem thrashes the L1 cache always and actually would run slower on all the high performance platforms to a large degree. This is the type of code I would use as a benchmark against a super pipelined CPU like the Williamette. This kind of code requires any CPU to let each data instruction run through its entire pipe before the next instruction can be started. Therefore an "Unknowing Customer" would decide under that benchmark the super pipelined processor is a dud. You have always argued that some of your fellow CPU designers call this a severe limitation and almost always set the significance of this problem too high. That more balanced stages are better for CPU performance than less. But in this benchmark, it is possible that a low latency CPU like a K6-3/400 would beat a high latency CPU like Williamette even at 1.5Ghz, given the same memory speed say, PC133 SDRAM. However, in almost all applications you would argue, that Williamette at 1.5Ghz would "Smoke" the K6-3/400. Since I know that most code does at least ten accesses to L1 before a L1 miss in general, the L2 swap will be done before a new L2 swap is needed. Thus the full speed nature of the L2 interface will be seen as well as the benefits of the high associativity and additional size. Therefore, TBird will outrun a K75 in almost all cases. Once code is optimized to take advantage of Tbirds cache architecture, the K75 will fall further behind. Remember that Duron will also run better on the code as well (the benefits of having a high volume low cost baby brother). Pete