Ali,
Some more willy stuff: This is from jc-s thread:
Willy : Instruction latencies.. FPU.. and more thoughts
Posted by Remnant on Tuesday, 15 February 2000, at 8:25 p.m.
(here : developer.intel.com
In the code optimization section, the following things stood out : INSTRUCTION LATENCIES! as you can imagine from such a huge pipeline, these are much increased. Check out these examples they gave :
shift instructions were 1-cycle on the p6 core. On Wilamette, they are 2-4 cycle.
integer and floating point multiply : was 4cycles on the P6 family, on Wilamette is "as many as 10" cycles.
The FXCH instruction, used to optimize P6 floating point code, is no longer a nearly free instruction. It now has penalties involved, and "should be avoided in Wilamette family processors"
Latencies always go up with a longer pipeline, but these are significant increases. The real kicker is the FXCH, which is currently used in optimized FPU code to achieve the highest speed on P2/3 CPUs. If this instruction has penalties on the wilamette, this is bad news for all existing code.
Also, in the whole datasheet I saw no mention of any improvements made to the p3 FPU core other than a load/save state operand. It seems the Intel is betting the whole farm on the extended SSE instructions. This is both beneficial and bad.
Benefits : potentially faster simpler for them to design than a new x87 fpu. with the new compiler out, people WILL start using SSE more.
Cons : need to be optimized for SSE to get anything outta it. With double-precision, you can only work on 2 64bit floats at once. Since I see no mention of a 2nd SSE pipeline, I'm not sure if this will be significantly faster than an advanced x87 fpu (ie Athlon) |