To: Petz who wrote (74115 ) 3/9/2002 8:15:37 AM From: pgerassi Read Replies (1) | Respond to of 275872 Dear Petz:
No, you did not say the same thing! This is where your error is. THe instructions that the CPU is given say multiply the contents of EAX to EDX and place it into EAX (IMUL EDX) and then store the result to 0x???????? (TEMPIJ) (MOV EAX TEMPIJ). The second will always set into motion a write to memory because the CPU can NOT assume the value will be either overwritten nor discarded. Later it may retrieve the result from L1 but that will not stop the eventual write into memory.
With more registers, the instructions may be IMUL EDX, MOV EAX GPR8 (I do not remember the actual x86-64 name). Now the CPU realizes that a memory operation is not required or requested. Second, it can perform another operation from the registers rather than memory and it can be paralleled with more register operation. The L1 data ports do not allow memory read operations in parallel per cycle. Thus you gain by reducing L1 cache bandwidth, reducing cache thrashing and allowing more instructions to be run concurrently thus, raising IPC. We have not reduced the amount of instructions but, we have increased performance and that is what we really want. Isn't it?
SuSE has finished gcc for x86-64 or else x86-64 Linux would not be available (it's used to compile the kernel and all of the apps). It is ultimately optimal? No, but, it's getting there and will be more optimal once real hardware is present (probably is now with Hammer sampling) and 100% functional (only AMD and a few others know if this is true). The code generator in the compiler can be extended from RISC machines easily wrt the additional 8 GPRs and 8 SSE registers.
BTW, in my example the 7 virtual registers are named something like EAX orig, EAX t3, EAX t4, EAX t5, ... with the retire counter at 2. The next cycle increments the retire counter to 3, EAX orig is freed and EAX t3 becomes EAX orig This is done via a map of registers at each counter and the orig set is kept at the retire counter. It's the map that is updated and it contains the virtual register's preset address say 0xda. If the set needs to be flushed, the retire counter remains at the current value and the next stages become overwritten over time.
This is normal now that OOE is the very prevalent. Further reading will net more details, if you want them.
Pete