To: dougSF30 who wrote (253082 ) 6/9/2008 2:46:36 PM From: pgerassi Read Replies (1) | Respond to of 275872 Doug: You don't understand that most access patterns fall within a general range and are well understood. Using a normal broadly based access pattern, the working set size, the execution pipeline length, the branch pipeline length, the cache parameters (Cache line length, number of sets associative, number of cache lines, latency and type (exclusive, inclusive or hybrid) for each level) and memory access, you could come fairly close to what tests will reveal is the true performance. Given your example you have the same execution and branch pipeline lengths for every set of cache parameters. Given that L1 access is the highest determinant of performance, a cycle there can equal 5-10 cycles of difference in L2. That is because on normal code, the L1 satisfies the memory request 85-95% of the time. So if Nehalem has the same execution and branch pipeline lengths of Penryn, with the extra cycle of L1 will not be mitigated by the lower latency of L2 or even the lower memory access latency given normal code. The only thing that will help is much larger clock speeds. The ratio has to be about (80% (execution pipeline + L1 latency) + 20% (branch pipeline + L1 latency)) for Nehalem / Penryn. For Nehalem to have sufficiently higher clocks, the pipeline has to be longer than Penryn. And its balance has changed so that it needs that extra cycle to be well balanced. In fact, that is one of the reasons Intel gives for lengthening the L1 access latency. Another reason would be to add associativity to the L1 level. Increased associativity hurts single thread performance, but gives much higher performance when multiple threads occur. Given hyperthreading and server type usage, it might be worth the extra cycle to go from 2 way set associative to 4 or 8 way set associative. Since the primary thread gives way to the secondary thread when it has to wait for something like L3 or a memory read to finish, the secondary thread likely will need to get something itself and if it is satisfied from L1 all the better. With 2 ways, there is likely that both are in play for the primary, but with 4 or more, one might be available for the secondary so it can get a few dozen cycles done before the primary grabs back control. Your contention that it is impossible to grade cache configurations just given their parameters is wrong. All that is needed is that all else stays the same. Both mas and many others are making that assumption. Nehalem may just have little tweaks that improve execution. They may think that the ODMC and memory BW overshadow any other compromises they make to make it work. The benches let out so far tend to show when the choices made give good advantage in performance per clock. We will have to wait for those that show the choices are poor. Pete