To: Petz who wrote (21643 ) 12/6/2000 8:48:52 PM From: fyodor_ Read Replies (1) | Respond to of 275872 Petz: From what I've read, not true for double precision (64 bits or 80 bits). In fact, the latencies are HIGHER on the P4 than on the P3, but the throughput is exactly the same, i.e., half that of the Athlon core. Well, the latencies are higher on the P4 than the P3 (I think, I wasn't able to find P3 date, despite JC's attempts to help me), but the throughput is effectively doubled since each operation can work on 2 64 bit numbers when using SSE2. Ok... that's what I thought anyway... before I read the numbers I had quoted ;). It seems that Intel has really pulled a number on this one and only the multiply actually receives the double throughput! You appear to be around 3/4's right ;) Sources: Intel's P4 Optimization guide and Stuart Oberman's article on Floating Point Division and Square Root Algorithms and Implementation in the AMD-K7 TM Microprocessor . P4: 2x64bit (SSE2) ADDPD 4/2 MULPD 6/2 DIVPD 62/62 SQRTPD 62/62 P4: 1x64bit (x87) dp FDIV 34/34 dp FSQRT 38/38 dp FADD 5/1 dp FMUL 7/2 AMD: 1x64bit (x87) dp DIV 20/17 dp SQRT 27/24 Ouch! I still say that the P4 is an improvement over its predicessor, though. One indication of these improvements can be seen in Tom's most recent P4 rant:sysdoc.pair.com Using the Intel optimized x87 double precision fp iDCT, FlasK manages 14fps with the 1.5GHz P4, compared to 8 for the PIII 1GHz. Clearly there are bandwidth issues as well, but when using SSE2 code, the P4 reaches 19fps. A 1.2GHz Athlon (using an "AMD"-optimized iDCT) does about the same as the 1.5GHz P4. When using 3DNow!, the Athlon barely goes above 15fps, but that's hardly surprising since 3DNow! doesn't really do anything to improve double precision math (so, in fact, I'm surprised it helped at all). -fyo