SI
SI
discoversearch

We've detected that you're using an ad content blocking browser plug-in or feature. Ads provide a critical source of revenue to the continued operation of Silicon Investor.  We ask that you disable ad blocking while on Silicon Investor in the best interests of our community.  If you are not using an ad blocker but are still receiving this message, make sure your browser's tracking protection is set to the 'standard' level.
Technology Stocks : Advanced Micro Devices - Moderated (AMD)
AMD 227.90+2.0%Jan 15 3:59 PM EST

 Public ReplyPrvt ReplyMark as Last ReadFilePrevious 10Next 10PreviousNext  
To: ajbrenner who wrote (22422)12/13/2000 5:28:14 PM
From: Dan3 of 275872
 
An interesting post from Ace's Hardware

Posted By: Tim Wilkens <wilkens@uiuc.edu>
Date: Saturday, 9 December 2000, at 12:46 a.m.

Hi everyone,

Well I wanted to post this message sort of to vindicate my predictions on the general forum a bit. There I claimed that SPEC FP 2000 performance is due in a large, extremely large, part to memory bandwidth. There have been people on the technical and general boards that attribute this performance to SSE2, caches and other nifty attributes of the P4. That's great. How could we analyze that. I say compile Spec using the CVF 6.5 compiler and MSVC++ 6.0 and then compare to Intel's wonderfully dependable and robust compilers we, the programmers can't download yet. Let's marvel at this performance we can't hope to get for months. I haven't seen anybody propose doing the above and it seems to be the most logical and simple task to analyze how important the SSE2 instrcutions are. I'm getting a bit cynical but when bus bandwidth only gets about 4 sentences in a review of the P4 ... well people can understand why I'm a little distressed. Esp when one uses SPEC fp 2000 to analyze a processor's strengths.

Ok.. I'm going to first show some Excel spread sheet numbers concerning the % increase the P4 got over the P3 upon INCREASING THE BUS SPEED BY A FACTOR OF 2X. It's not rocket science.. and I think it's pretty damn convincing. Then I'm going to give the P4 it's due and propose and ask questions that can foster some good healthy discussion.

Here are the numbers for each benchmark...

BENCHMARK / %INCREASE UPON PROVIDING

AN ADDITIONAL 2X more BANDWIDTH

133Mhz ==> 400

(P4#/P3# - 1)*100
--------------------------------------------
wupwise / 81%
swim / 152%
mgrid / 135%
applu / 131%
mesa / 7%
galgel / 49%
art / 29%
equake / 159%
facerec / 44%
ammp / 21%
lucas / 133%
fma3d / 44%
sixtrack / 49%
apsi / 12%

What do we reasonably expect. Well I'd say a code that's well written should expect 20-30% more perf upon going from 133Mhz ==> 266Mhz and then 40-60% upon moving from 266 ==> 400 Mhz.

5, YES FOLKS, 5!! benchmarks increased by more than 130%. 200% would represent perfect linear scaling with the bus speed. INCREDIBLE. I thought galgel and art would be culprits but jesus, I never expected this kind of coding. If people depended this heavily upon the bus bandwidth of processors in actuality then where would we be today? This is a joke. I think it's sad that this benchmark even has the words CPU related to it.

You know I can give a processor it's due. I'm not all AMD, only 90% AMD :o).

P4.. hardware prefetch.. this is a godsend for this processor, and intel should be congratulated on that. Great job. And in many posts on this board I've stated that the 850 Chipset it remarkable. The guys who put up with all the rambus difficulties and constructed this remarkable kick ass chipset truely did the P4 chip architects a favor. But I think 1/2 of the perf that the P4 gets from it is due to hardware prefetch. I think that coupled with this awesome chipset are responsible for the incredible memory bandwidth seen in spec fp 2000. How about some analysis of this.

Now what about the 760 chipset? Well it's really nice and performs well but it's not reaching the same levels of perf that the 850 is seeing? Is this a fault of the 760 chipset? NOPE. I do not believe this. What I believe is the proc simply can not take adv of that bandwidth. P4 clearly has a better branch predicator and it's got HARDWARE PREFETCH. I believe that HARDWARE prefetch keeps that cpu core fed better and allows better use of the chipset's bandwidth. We will most definitely know soon.. PALOMINO is coming. I'm eagerly anticipating this chip's release. THEN we can say whether the 760 holds a candle to the 850 or not. I don't know now.. but it's fair to say that as of right now the 850+HD PREFETCH stomps on 760-WITHOUT HD PREFETCH. Will the tables turn?

WHAT ABOUT SPEC FP 2000. Why the pooor... incredibly poor memory management. Does this have to do with NOT USING OPTIMIZED BLAS. Yes.. the folks at spec in their infinite wisdom have chosen to put the BLAS source in the benchmarks themsleves and then renamed the routines. Dgemm isn't dgemm anymore but something else. This was told to me by one of the people at spec and on the board of spec. Only 2 benchmarks out of the 14 can actually be linked to with external optimzied BLAS binaries. The others.. well you are ... really out of luck. What does this do to perf. Here it is:

DGEMM Source Compiled:
w3.physics.uiuc.edu
==> ~100 MFLOPS

DGEMM OPTIMIZED IN ATLAS(www.netlib.org/atlas/archives IT'S FREE):
w3.physics.uiuc.edu
==> 1400 MFLOPS

One is left awestruck. For matrices larger than the L2 size Atlas gets 18x more performance on a 1.2 Ghz Athlon with DDR. To be fair.. I don't have a P4 and can't compile ATLAS on that .. but I'm trying to convince someone to do this for me. Is anybody game. I used the PIII binary for 256K L2 and 32K L1 for the P4 there. Spec is using matrices of 1000x1000 or larger in many of these applications. I've read each benchmark's description and it's clear that significant perf could be gained by using these OPTIMIZED ATALS BLAS routines.

Now ATLAS is free.. why not package this wonderful program with SPEC and use it. Anyone running anything remotely like spec on a machine should know better than compile the source to BLAS. IF not then.. they are either very new or well .. I won't comment on that. BUT COMPILING SPEC BLAS cripples performance folks and to ME and others out there like PER, EMIL BRIGS and others .. this spec performance doesn't reflect reality. Link to real fast math libraries publicly available, this isn't proprietary stuff, and give me the numbers.

I've spoken with a couple people at SPEC and I get the run around that linking to fast math libraries isn't representative of what the average joe is capable of. Well I get the feeling that they think we are a bunch of ignorant asses or something. It's very simple to do this in CVF 6.5 ... if anybody wants to know:

GO INTO SETTINGS ==> SELECT FORTRAN ==> EXTERNAL PROC ==> Select C by reference

external linking
GO INTO SETTINGS ==> SELECT LINK ==> INPUT ==> Put names of libraries in

"Object/library modules"

Put path to libraries in

"Additional library path"

now that's not too difficult and everybody here is smart enough to figure that out. These math libraries are free. Use them SPEC. Also.. I keep wondering WHEN WILL YOU TAKE DOWN THAT 1.133 Ghz PIII score? Last I heard it wasn't on the market.

Well I've undoubtedly pissed some people off.. but I wanted to make some points. I think I have and I appologize for being a bit brutal, I hope we can really address some of this intelligently. And to flamers don't expect me to respond.

Tim Wilkens
aceshardware.com
Report TOU ViolationShare This Post
 Public ReplyPrvt ReplyMark as Last ReadFilePrevious 10Next 10PreviousNext