SI
SI
discoversearch

We've detected that you're using an ad content blocking browser plug-in or feature. Ads provide a critical source of revenue to the continued operation of Silicon Investor.  We ask that you disable ad blocking while on Silicon Investor in the best interests of our community.  If you are not using an ad blocker but are still receiving this message, make sure your browser's tracking protection is set to the 'standard' level.
Technology Stocks : Advanced Micro Devices - Moderated (AMD) -- Ignore unavailable to you. Want to Upgrade?


To: Petz who wrote (21643)12/6/2000 3:53:56 AM
From: ptannerRead Replies (2) | Respond to of 275872
 
From Aces: AMD to Launch Mobile Duron This Month

Just a brief news item from CTech Taiwan...
aceshardware.com

Highlights: all of the top 5 Taiwan notebook manufacturers have finished designs; may be an "into and then release."

-PT

ps: Just installed the Super Orb heatsink. And am browsing the web for a "light load" data point. Basically so far I have learned that if you scrape off the crummy thermal pad you should absolutely replace it with any thermal compound (but compound was better than the stock pad, IIRC).

I originally posted this on the classic thread but meant to put it here; but I figure most folks read both just one a little more closely than the other depending on the topic(s) of the moment.



To: Petz who wrote (21643)12/6/2000 1:48:34 PM
From: jcholewaRead Replies (1) | Respond to of 275872
 
> From what I've read, not true for double precision (64 bits or 80 bits). In fact, the latencies are HIGHER on
> the P4 than on the P3, but the throughput is exactly the same, i.e., half that of the Athlon core.

I am nearly positive that SSE2 is situated as one pipeline with two units (one for mulpd, the other for addpd). Each unit can only be issued an instruction every other cycle, but an instruction can be fed into the pipeline every cycle. Therefore, you can alternate addpd and mulpd every cycle, which means you have a throughput of one instruction per cycle, or two operations per cycle.

In double precision code alternating between fadds and fmuls, the Athlon can do both one add and one mul per cycle.

If the code happened to be all adds or all muls, then the P4 (in SSE2) would do two operations every two cycle while the Athlon would do one operation every one cycle.

This applies only to 64-bit double precision. I believe that 80-bit extended precision does not apply to SSE2, so in that case the Pentium 4's peak is half that of the Athlon's, yes.

> The problem with the Athlons double precision math advantage is that very often the weak link in the chain
> is the L2 cache throughput or the memory throughput.
> Single channel PC2100 can't match dual channel RDRAM with a 400 MHz bus.

That is a valid assessment. More or less.

    -JC



To: Petz who wrote (21643)12/6/2000 8:48:52 PM
From: fyodor_Read Replies (1) | Respond to of 275872
 
Petz: From what I've read, not true for double precision (64 bits or 80 bits). In fact, the latencies are HIGHER on the P4 than on the P3, but the throughput is exactly the same, i.e., half that of the Athlon core.

Well, the latencies are higher on the P4 than the P3 (I think, I wasn't able to find P3 date, despite JC's attempts to help me), but the throughput is effectively doubled since each operation can work on 2 64 bit numbers when using SSE2.

Ok... that's what I thought anyway... before I read the numbers I had quoted ;). It seems that Intel has really pulled a number on this one and only the multiply actually receives the double throughput! You appear to be around 3/4's right ;)

Sources: Intel's P4 Optimization guide and Stuart Oberman's article on Floating Point Division and Square Root Algorithms and Implementation in the AMD-K7 TM Microprocessor.

P4: 2x64bit (SSE2)
ADDPD 4/2
MULPD 6/2
DIVPD 62/62
SQRTPD 62/62

P4: 1x64bit (x87)
dp FDIV 34/34
dp FSQRT 38/38
dp FADD 5/1
dp FMUL 7/2

AMD: 1x64bit (x87)
dp DIV 20/17
dp SQRT 27/24

Ouch!

I still say that the P4 is an improvement over its predicessor, though. One indication of these improvements can be seen in Tom's most recent P4 rant:

sysdoc.pair.com

Using the Intel optimized x87 double precision fp iDCT, FlasK manages 14fps with the 1.5GHz P4, compared to 8 for the PIII 1GHz. Clearly there are bandwidth issues as well, but when using SSE2 code, the P4 reaches 19fps.

A 1.2GHz Athlon (using an "AMD"-optimized iDCT) does about the same as the 1.5GHz P4. When using 3DNow!, the Athlon barely goes above 15fps, but that's hardly surprising since 3DNow! doesn't really do anything to improve double precision math (so, in fact, I'm surprised it helped at all).

-fyo