SI
SI
discoversearch

We've detected that you're using an ad content blocking browser plug-in or feature. Ads provide a critical source of revenue to the continued operation of Silicon Investor.  We ask that you disable ad blocking while on Silicon Investor in the best interests of our community.  If you are not using an ad blocker but are still receiving this message, make sure your browser's tracking protection is set to the 'standard' level.
Technology Stocks : Advanced Micro Devices - Moderated (AMD) -- Ignore unavailable to you. Want to Upgrade?


To: kapkan4u who wrote (19235)11/15/2000 3:43:18 PM
From: jcholewaRespond to of 275872
 
> FXCH is not free on P4.

I think we're talking different tracks here. I'm comparing the P4's SSE2 capabilities against the x87 stack in other processors. I was stemming this largely from the discussion seed of how the P4 attained scores much greater than those offered by other processors.

> I don't want to argue the advantages of registers over stack. Seems pretty basic to me.

As far as I know, a 2-op flat can, in one instruction, take any two registers, perform an operation with their values, and (in the case of SSE2, I think?) place the resultant value into one of the two given registers.

The x87 stack, or at least one that isn't crippled with a costly FXCH, can perform an operation on the register at the top of the stack and any other register, then push the resultant value onto the top of stack. An FXCH op issued just before this can exchange any register with the top of the stack, so effectively you are performing an operation on any two registers and placing the result into one of the two given registers.

I know you do not wish to delve into this, but I would still be happy if you could point out if anything I said above is erroneous. Aside from the decode bottleneck, these two implementations do not seem so amazingly different to me, which is why I am interested in hearing an alternative explanation.
 
 
 
> <PS: Also, you can only do one SSE2 instruction per cycle, so the peak goes down by 50% there.>
> 50% comparing to what? P4 only has one FPU so the peak is one scalar (or half packed) SSE/SSE2
> instruction per clock. The advantage for packed comes with 128bit interface to the d-cache.

Sorry, I was comparing to packed. Also, I was comparing to the competing x87 stack, which is in the end what we're putting the P4's capabilities up against.

&nbsp;&nbsp;&nbsp;&nbsp;-JC



To: kapkan4u who wrote (19235)11/15/2000 11:56:41 PM
From: PetzRespond to of 275872
 
kap, I wonder why Intel didn't change the SSE2 registers to be 256 bit wide? That would interface directly to the L2. But, then again, maybe the interface to the itsy-bitsy L1 is only 128 bits though.

Petz



To: kapkan4u who wrote (19235)12/11/2000 12:30:03 AM
From: jeff_boyd___Read Replies (1) | Respond to of 275872
 
"Glad To Hear You Are Still Long AMD Kap"

It has not been easy to stay long over the last few months.

Regards,

Jeff