SI
SI
discoversearch

We've detected that you're using an ad content blocking browser plug-in or feature. Ads provide a critical source of revenue to the continued operation of Silicon Investor.  We ask that you disable ad blocking while on Silicon Investor in the best interests of our community.  If you are not using an ad blocker but are still receiving this message, make sure your browser's tracking protection is set to the 'standard' level.
Technology Stocks : CYRIX / NSM -- Ignore unavailable to you. Want to Upgrade?


To: Pravin Kamdar who wrote (28353)7/17/1998 1:03:00 PM
From: Steve Porter  Read Replies (2) | Respond to of 33344
 
Pravin,

How nice to see you remember me <G>

Re: Compiler optimizations:

Yes it is quite true. Almost all major compiler vendors only support
Intel CPUs when it comes to code optimization. This unfortunate fact
potentially can cause serious degredation in software, FPU especially.
Case in point. If I were performing 4 floating point operations that
weren't interdependant the compiler would set it up so that:

FPU1 FPU2
FMUL a,b FDIV d,c
FADD d,a FMUL a,c

Now while this is great is you are running on a pentium or other CPU
with 2 or more fpu pipes, it really can break when you get to a single
stage.

The biggest problem actually comes from a scheduling conflict or
deadlock that is a result of integer instructions which are cued (due
to the backlog) and stuck behind floating point instructions which are
waiting their turn. A CPU will only go so far out of order. So in a
typical instruction stream you will get:

This is for information purposes to demonstrate the problem. This
example ignores such issues as execution time, latency, cross
dependancies and other internal CPU mumbo-jumbo ;-)


PENTIUM or Other multi piped, CPU:

INT1 INT2 INT3 FPU1 FPU2
1 (1)fadd c,d (2)fmul b,b
2 (4)inc eax (6)pop dx (3)fsub x,y (5)fusb c,b

Now as you can see that took 2 "clock" cycles for lack of a better
word. Now watch what happens when the same code runs a differnt CPU
(single FPU unit for example). Again I am over simplify for the sake
of explanation.

CYRIX or AMD or IDTI, etc. single floating point pipe

INT1 INT2 INT3 FPU1
1 (1)fadd c,d
2 (2)fmul b,b
3 (4)inc eax (3)fsub x,y
4 (6)pop dx (5)fusb c,b

So basically 4 clock cycles. The thing I want to point out though is
that instruction (4) inc eax didn't gt processed until the
"third" cycle, becuase it was stuck behind too many floating point
opps.

Since the integer code above doesn't directly depend on the floating
point code, a good compiler which recognized that there is only 1 FPU
would reorder the integer instructions to the top of the stream, so
they weren't "stalled" behind the FP instructions.

This is the reason that CYRIX and AMD recommend that you optimize your
code for the 486. The 486 was a single issue floating point unit,
much like AMDS and Cyrix's.

Hope this make some sense to everyone ;-) As I say I have over
simplified the above examples as it would take many hundreds of
instructions to achieve the same result in real life (because the CPU
would try and reorder things on the fly, etc.) And I know it ain't a
"clock" cycles, but for lack of a better word ;-)

Steve