To: Pravin Kamdar who wrote (28353 ) 7/17/1998 1:03:00 PM From: Steve Porter Read Replies (2) | Respond to of 33344
Pravin, How nice to see you remember me <G> Re: Compiler optimizations: Yes it is quite true. Almost all major compiler vendors only support Intel CPUs when it comes to code optimization. This unfortunate fact potentially can cause serious degredation in software, FPU especially. Case in point. If I were performing 4 floating point operations that weren't interdependant the compiler would set it up so that: FPU1 FPU2 FMUL a,b FDIV d,c FADD d,a FMUL a,c Now while this is great is you are running on a pentium or other CPU with 2 or more fpu pipes, it really can break when you get to a single stage. The biggest problem actually comes from a scheduling conflict or deadlock that is a result of integer instructions which are cued (due to the backlog) and stuck behind floating point instructions which are waiting their turn. A CPU will only go so far out of order. So in a typical instruction stream you will get:This is for information purposes to demonstrate the problem. This example ignores such issues as execution time, latency, cross dependancies and other internal CPU mumbo-jumbo ;-) PENTIUM or Other multi piped, CPU: INT1 INT2 INT3 FPU1 FPU2 1 (1)fadd c,d (2)fmul b,b 2 (4)inc eax (6)pop dx (3)fsub x,y (5)fusb c,b Now as you can see that took 2 "clock" cycles for lack of a better word. Now watch what happens when the same code runs a differnt CPU (single FPU unit for example). Again I am over simplify for the sake of explanation. CYRIX or AMD or IDTI, etc. single floating point pipe INT1 INT2 INT3 FPU1 1 (1)fadd c,d 2 (2)fmul b,b 3 (4)inc eax (3)fsub x,y 4 (6)pop dx (5)fusb c,b So basically 4 clock cycles. The thing I want to point out though is that instruction (4) inc eax didn't gt processed until the "third" cycle, becuase it was stuck behind too many floating point opps. Since the integer code above doesn't directly depend on the floating point code, a good compiler which recognized that there is only 1 FPU would reorder the integer instructions to the top of the stream, so they weren't "stalled" behind the FP instructions. This is the reason that CYRIX and AMD recommend that you optimize your code for the 486. The 486 was a single issue floating point unit, much like AMDS and Cyrix's. Hope this make some sense to everyone ;-) As I say I have over simplified the above examples as it would take many hundreds of instructions to achieve the same result in real life (because the CPU would try and reorder things on the fly, etc.) And I know it ain't a "clock" cycles, but for lack of a better word ;-) Steve