To: kapkan4u who wrote (19235 ) 11/15/2000 3:43:18 PM From: jcholewa Respond to of 275872 > FXCH is not free on P4. I think we're talking different tracks here. I'm comparing the P4's SSE2 capabilities against the x87 stack in other processors. I was stemming this largely from the discussion seed of how the P4 attained scores much greater than those offered by other processors. > I don't want to argue the advantages of registers over stack. Seems pretty basic to me. As far as I know, a 2-op flat can, in one instruction, take any two registers, perform an operation with their values, and (in the case of SSE2, I think?) place the resultant value into one of the two given registers. The x87 stack, or at least one that isn't crippled with a costly FXCH, can perform an operation on the register at the top of the stack and any other register, then push the resultant value onto the top of stack. An FXCH op issued just before this can exchange any register with the top of the stack, so effectively you are performing an operation on any two registers and placing the result into one of the two given registers. I know you do not wish to delve into this, but I would still be happy if you could point out if anything I said above is erroneous. Aside from the decode bottleneck, these two implementations do not seem so amazingly different to me, which is why I am interested in hearing an alternative explanation. > <PS: Also, you can only do one SSE2 instruction per cycle, so the peak goes down by 50% there.> > 50% comparing to what? P4 only has one FPU so the peak is one scalar (or half packed) SSE/SSE2 > instruction per clock. The advantage for packed comes with 128bit interface to the d-cache. Sorry, I was comparing to packed. Also, I was comparing to the competing x87 stack, which is in the end what we're putting the P4's capabilities up against. -JC