SI
SI
discoversearch

We've detected that you're using an ad content blocking browser plug-in or feature. Ads provide a critical source of revenue to the continued operation of Silicon Investor.  We ask that you disable ad blocking while on Silicon Investor in the best interests of our community.  If you are not using an ad blocker but are still receiving this message, make sure your browser's tracking protection is set to the 'standard' level.
Technology Stocks : Advanced Micro Devices - Moderated (AMD) -- Ignore unavailable to you. Want to Upgrade?


To: mas_ who wrote (253030)6/8/2008 1:07:46 PM
From: dougSF30Respond to of 275872
 
Anything that fits entirely in L1, like some handwritten code/function, will definitely be worse.

Wrong. Hint: "other core improvements"

Anything that fits entirely in Penryn's 6MB L2 (15 cycles) will also be worse as 5.75MB of the equivalent Nehalem cache will be at 39 cycles.

Wrong again. Do you really need an example?

How about a 6MB block of data that is repeatedly processed by a core IN ORDER. You don't think the prefetchers hide the latency almost completely? Given Nehalem's core improvements, this example would have a shot at being faster vs. Penryn (clock/clock) even with both N's L2 and L3 disabled completely: after the core waits for the first few fetches from main memory, the prefetchers kick in and have the remaining 6MB of data arriving exactly when needed.

This should not be a difficult concept.

Now think of a typical real-world example, with some, but not total regularity of access. The extent of "localization" of access, and regularity, will determine whether the much faster L2 + faster core outperforms the much larger, slower L2 and faster (supposedly) L1.

And that is before we throw in the main-memory latency & bandiwdth advantages.

Take the list example, but suppose it works like this:

The code randomly accesses a 256KB block within a 6MB working set for 100000 iterations, then moves to a different 256KB block, etc.

In this case, the L2 is the bottleneck, and the Nehalem cache system blows the doors off Penryn's, despite the overall 6MB working set size.

Now make the 100000 iterations 10000. Then 1000, then 100. Do you get it?



To: mas_ who wrote (253030)6/8/2008 9:50:28 PM
From: graphicsguruRead Replies (2) | Respond to of 275872
 
Mas: Anything that fits entirely in Penryn's 6MB L2 (15 cycles) will also be worse.

That's a very strong claim. Why don't you tell me your assumption about the
hit rates of the various caches. If the L2 has a high enough hit rate, its
faster speed in Nehalem can more than compensates for the lost cycles on
data that has to be fetched from L3. Pick realistic hit rates and do the math
before you make this kind of claim.

Of course, one can create code that will run slower on Nehalem than
Penryn. But that's not the question. Can you point to any example of
real code doing something useful that you're confident will run slower
per clock? That would be a substantive prediction.

You've been asked for this, and haven't replied.