SI
SI
discoversearch

We've detected that you're using an ad content blocking browser plug-in or feature. Ads provide a critical source of revenue to the continued operation of Silicon Investor.  We ask that you disable ad blocking while on Silicon Investor in the best interests of our community.  If you are not using an ad blocker but are still receiving this message, make sure your browser's tracking protection is set to the 'standard' level.
Technology Stocks : Advanced Micro Devices - Moderated (AMD) -- Ignore unavailable to you. Want to Upgrade?


To: combjelly who wrote (253019)6/8/2008 12:26:16 AM
From: graphicsguruRead Replies (1) | Respond to of 275872
 
Comb:
it is almost certain that Penryn will beat Nehalem on cache bound, single threaded, or at least small number of threaded, code at the same frequency.


Wow, that's quite a leap just from the fact that Nehalem
supports hyperthreading and scales to more cores.

I'll argue the opposite. Nehalem as a "tock" processor
is optimized for the current process, unlike Penryn.
The current process has a different ratio of p-mos to
n-mos performance than the previous process. Certainly,
taking that into account should make it possible to
design a superior cache hierarchy.

Given that the L2 on Nehalem is *much* faster than
Penryn's L2, how can you be "almost certain" that its
single-thread performance is worse at the same frequency?

Why don't you pick some specific apps that you think will
run slower per clock on Nehalem, and we'll see if your
prediction turns out to be true?



To: combjelly who wrote (253019)6/8/2008 4:23:42 PM
From: wbmwRespond to of 275872
 
Re: The point that mas is trying to make is that CPU architecture is game of trade offs. Not to belabor the obvious, but Penryn isn't Nehalem. The Penryn was designed for fewer max. cores than Nehalem. Not to mention HT. As a result, Penryn was pushed towards maximizing performance for a relatively small number of physical cores. Nehalem was optimized for a larger number of virtual, and for that matter, physical cores. The tradeoffs are almost certainly not the same. Given that Intel likely did not screw up the Penryn caches, it is almost certain that Penryn will beat Nehalem on cache bound, single threaded, or at least small number of threaded, code at the same frequency.

There's no reason the architecture of Nehalem couldn't have been designed to improve both multi-threaded *and* single threaded performance. I recall seeing in the IDF slides a number of features targeted at single threaded performance. Real World Tech covers them here:

realworldtech.com

The multithreaded enhancements just happen to have a greater effect on overall performance. If enhancing the multithreaded performance requires tradeoffs in single threaded performance - such as longer latencies or smaller capacities on the low level caches - this performance can still be made up elsewhere.

Focusing on just the caches by themselves ignores the fact that modern processors are complex systems with strings of dependencies and tradeoffs on the various components. For example, in order to improve the characteristics of Unit A, it may require that you relax the timing of Unit B. Then, if someone were to argue about Unit B causing a performance loss, they would ignore that the changes in Unit B came as a result in improving the characteristics of Unit A, which could result in a net improvement across the relevant workloads.

As for what "relevant workloads" mean, you may have a point that it depends on what applications you care about, but in general it's safe to say that mainstream computing has moved away from applications that are strictly cache-bound, or for that matter strictly favorable to any single kind of micro-architectural feature.

I would disagree that it's a world of corner cases, because I think it's more accurate to say that the world is becoming more homogenized, with real performance coming from strengths in all areas of the chip. Optimizing for "just bandwidth", or "just multithreading", or "just caches" will result in a machine that does poorly in most workloads. You actually need to make the best tradeoffs across all units, such that the micro-architecture is as balanced as possible, and power efficient besides.

I think Nehalem is an example of a finely tuned architecture that started with a rather high performance core, and improved upon it by addressing its weaknesses, and further polishing its strengths. Even on Anand's untuned motherboard with poor memory performance, he still showed a part capable of outperforming the previous generation by very healthy margins across a decent spectrum of workloads.

And even in the one single threaded benchmark (Cinebench with one thread enabled), it still managed to outperform Penryn by about 3%. So clearly, your initial statement has been proven to be incorrect (that Penryn will be almost certainly faster in this kind of workloads). Moreover, the performance is likely to improve with production worthy systems.

I think early Nehalem benchmarks demonstrate a clear improvement over the previous generation, but the results are still unclear. If Anand is right about memory performance improving with production worthy boards, we may see significant upside on what is already some very impressive results. To argue against this is premature, and there really isn't a lot of data for you, or mas, or anyone else to proclaim anything as obvious or certain.

For most end-users, I don't think these arguments are going to mean much. All they care about is overall performance, and if most applications scale with the kind of improvements that Nehalem has, then Intel scored a home run on this micro-architecture. That, IMO, is the bottom line.