SI
SI
discoversearch

We've detected that you're using an ad content blocking browser plug-in or feature. Ads provide a critical source of revenue to the continued operation of Silicon Investor.  We ask that you disable ad blocking while on Silicon Investor in the best interests of our community.  If you are not using an ad blocker but are still receiving this message, make sure your browser's tracking protection is set to the 'standard' level.
Technology Stocks : Advanced Micro Devices - Moderated (AMD) -- Ignore unavailable to you. Want to Upgrade?


To: Ali Chen who wrote (227273)3/2/2007 7:26:11 PM
From: pgerassiRespond to of 275872
 
Dear Ali:

Better take a hard look at the graphs. Using the already widely felt poor base rather than the better peak, Opteron has lower task time scores than any P4 at lower clocks. You might have missed that the X axis is clock period and not the more typical clock frequency that rises as you go right. Furthermore he didn't want to include the Power 5 scores as the trend line would have intersected the y axis below 0 which would disprove his methods. Given that the A64 FX has an unlocked multiplier, he could have run a trend for that one, but that likely would also have shown his methods to be wrong. Also note that the Opterons used registered memory and the P4s either used unregistered RAMBUS DRAM or DDR. Even with the extra cycles of registered over unregistered, the Opteron trend line was below most of the P4s.

Also if he would have used Sun's Studio and used peak, the Opterons would have killed the P4s and even the Itaniums hands down as shown by the current SPECfp_2000 scores.

Pete



To: Ali Chen who wrote (227273)3/4/2007 12:49:09 PM
From: DDB_WORespond to of 275872
 
Ali - The attempt to reduce main memory latency is indeed a nobel move. However, as I said many times, the advantage of AMD approach to memory handling is highly exaggerated. You need to look at the overall effect of whole memory subsystem, including the art of hardware prefetch, quality of software prefetch, cache miss rates, and FSB/memory penalties.

Well, already the first Opterons had a measurable advantage in memory latency compared to K7. And this was likely one of the main reasons, that these CPUs could still compete, even with much lower L1/L2 bandwidth, L2 size, not so good prefetchers (first update came with rev. E as announced by McGrath in his Stanford talk), inefficient SSE implementation etc. So I think, it's not correct to say, that the IMC had no positive effect at all. But it's ok to assume, like you did, that it's effect is not the maximum of what could have been achieved.

It's always the same: the companies (esp. the smaller ones) have limited R&D ressources and have to decide, which cards to play. One company choose to first improve on RAM (RDRAM), NB, FSB, prefetching and L2 cache size, while the other only went with RAM (DDR) and NB (integrating it into the CPU) first while also increasing the L2 cache size. It's not a guessing game for them since hardware changes faster than the huge amount of software out there.

If you look even at somewhat old data,
home.austin.rr.com
the bottom line is that AMD already has 50-60% disadvantage in overall memory waste traffic even as compared to old Pentium D, with older chipset and slower memory, compare line (4) with lines (6) or (7). It translates into 25-30% of loss in overall performance. The data are for SPEC2000, which has smaller data sets than newer SPEC2006, so the gap must be more pronounced in 2006. But even if the AMD statement is true (100ns vs.70/55ns), which I doubt (their base might be quite obsolete), their latency effort is in right direction.


An interesting chart. Clearly shows, that K8 would need to increase clock faster to keep up with SC Prescott having 2 MB L2. At ~6 GHz they would have about the same SPECfp_base2k performance. There is no Pentium D in the chart (typo?). This 25-30% loss would however not happen in reality as long as we are that far away from THz frequencies.

However the steepness and position of the 2M Prescott line vs. that of the (what I assume) 1M Prescott [3] line shows some effect of the L2 alone. The FX is the fastest x86 CPU there, but maybe also thanks to unregistered DDR RAM. However with 2 samples it's already unreliable to extrapolate this way. With one sample it's obviously impossible.

I'm just wondering about this 55ns number, since latencies of current K8s are already that low. IIRC then P4 had ~100+ latencies in Rightmark Memory Analyser results, while Dothan and Yonah already are in ~50-70ns regions for random accesses. There are also higher numbers for K8 and Yonah depending on the size of the arrays. Having to close and open DRAM pages is really adding a lot to latency.

But in Barcelona a direct prefetch to L1 would save 16 or so cycles vs. K8, the improved prefetching itself plus the IMC prefetcher, optimized page conflict handling and other improvements in the memory subsystem will have effects, which are rather difficult to predict. You could browse AMD's patents in this regard. But they also don't tell every detail.

Regards,
Matt