SI
SI
discoversearch

We've detected that you're using an ad content blocking browser plug-in or feature. Ads provide a critical source of revenue to the continued operation of Silicon Investor.  We ask that you disable ad blocking while on Silicon Investor in the best interests of our community.  If you are not using an ad blocker but are still receiving this message, make sure your browser's tracking protection is set to the 'standard' level.
Technology Stocks : Advanced Micro Devices - Moderated (AMD) -- Ignore unavailable to you. Want to Upgrade?


To: kapkan4u who wrote (19222)11/15/2000 2:41:29 PM
From: jcholewaRead Replies (1) | Respond to of 275872
 
> First packed SSE2 IS 2-way double precision not 4-way.

That was a typo. Honest. I really do know that you can't pack four 64-bit values into one 128-bit register!! :)
 
 
 
> Second you forget that SSE/SSE2 have both scalar and packed versions of the same instructions.
> Scalar SSE/SSE2 can be used anywhere to replace x87 arithmetic. The advantage is to use a
> flat register file instead of the brain dead x87 stack.

Yeah, but it's a brain dead two operand flat register model, not three operand. I am not trying to pass myself off as any kind of expert, but other than constraints at the decode end of the instruction pipe, I don't see much difference between the flexibility of a two-operand flat model and a stack equipped with an effective free FXCH. I would be grateful if you could go into a little bit of detail about that. :)

    -JC

PS: Also, you can only do one SSE2 instruction per cycle, so the peak goes down by 50% there.



To: kapkan4u who wrote (19222)11/15/2000 10:36:16 PM
From: PetzRespond to of 275872
 
kap, do you think P4 can do a 2-way packed double precision multiply in one cycle, two cycles or four cycles?

What about add?

I'm talking throughput, not latency.

Also, does SSE2 have a double precision multiply-accumulate instruction which can be coded 2-way?

Petz