Silicon Investor (SI) -- The First Internet Community

STOCKTALK

We've detected that you're using an ad content blocking browser plug-in or feature. Ads provide a critical source of revenue to the continued operation of Silicon Investor. We ask that you disable ad blocking while on Silicon Investor in the best interests of our community. If you are not using an ad blocker but are still receiving this message, make sure your browser's tracking protection is set to the 'standard' level.

Technology Stocks : Advanced Micro Devices - Moderated (AMD) -- Ignore unavailable to you. Want to Upgrade?

To: dougSF30 who wrote (253082)	6/9/2008 2:46:36 PM
From: pgerassi	Read Replies (1) \| Respond to of 275872

Doug: You don't understand that most access patterns fall within a general range and are well understood. Using a normal broadly based access pattern, the working set size, the execution pipeline length, the branch pipeline length, the cache parameters (Cache line length, number of sets associative, number of cache lines, latency and type (exclusive, inclusive or hybrid) for each level) and memory access, you could come fairly close to what tests will reveal is the true performance. Given your example you have the same execution and branch pipeline lengths for every set of cache parameters. Given that L1 access is the highest determinant of performance, a cycle there can equal 5-10 cycles of difference in L2. That is because on normal code, the L1 satisfies the memory request 85-95% of the time. So if Nehalem has the same execution and branch pipeline lengths of Penryn, with the extra cycle of L1 will not be mitigated by the lower latency of L2 or even the lower memory access latency given normal code. The only thing that will help is much larger clock speeds. The ratio has to be about (80% (execution pipeline + L1 latency) + 20% (branch pipeline + L1 latency)) for Nehalem / Penryn. For Nehalem to have sufficiently higher clocks, the pipeline has to be longer than Penryn. And its balance has changed so that it needs that extra cycle to be well balanced. In fact, that is one of the reasons Intel gives for lengthening the L1 access latency. Another reason would be to add associativity to the L1 level. Increased associativity hurts single thread performance, but gives much higher performance when multiple threads occur. Given hyperthreading and server type usage, it might be worth the extra cycle to go from 2 way set associative to 4 or 8 way set associative. Since the primary thread gives way to the secondary thread when it has to wait for something like L3 or a memory read to finish, the secondary thread likely will need to get something itself and if it is satisfied from L1 all the better. With 2 ways, there is likely that both are in play for the primary, but with 4 or more, one might be available for the secondary so it can get a few dozen cycles done before the primary grabs back control. Your contention that it is impossible to grade cache configurations just given their parameters is wrong. All that is needed is that all else stays the same. Both mas and many others are making that assumption. Nehalem may just have little tweaks that improve execution. They may think that the ODMC and memory BW overshadow any other compromises they make to make it work. The benches let out so far tend to show when the choices made give good advantage in performance per clock. We will have to wait for those that show the choices are poor. Pete