SI
SI
discoversearch

We've detected that you're using an ad content blocking browser plug-in or feature. Ads provide a critical source of revenue to the continued operation of Silicon Investor.  We ask that you disable ad blocking while on Silicon Investor in the best interests of our community.  If you are not using an ad blocker but are still receiving this message, make sure your browser's tracking protection is set to the 'standard' level.
Technology Stocks : Advanced Micro Devices - Moderated (AMD) -- Ignore unavailable to you. Want to Upgrade?


To: Petz who wrote (74115)3/9/2002 8:15:37 AM
From: pgerassiRead Replies (1) | Respond to of 275872
 
Dear Petz:

No, you did not say the same thing! This is where your error is. THe instructions that the CPU is given say multiply the contents of EAX to EDX and place it into EAX (IMUL EDX) and then store the result to 0x???????? (TEMPIJ) (MOV EAX TEMPIJ). The second will always set into motion a write to memory because the CPU can NOT assume the value will be either overwritten nor discarded. Later it may retrieve the result from L1 but that will not stop the eventual write into memory.

With more registers, the instructions may be IMUL EDX, MOV EAX GPR8 (I do not remember the actual x86-64 name). Now the CPU realizes that a memory operation is not required or requested. Second, it can perform another operation from the registers rather than memory and it can be paralleled with more register operation. The L1 data ports do not allow memory read operations in parallel per cycle. Thus you gain by reducing L1 cache bandwidth, reducing cache thrashing and allowing more instructions to be run concurrently thus, raising IPC. We have not reduced the amount of instructions but, we have increased performance and that is what we really want. Isn't it?

SuSE has finished gcc for x86-64 or else x86-64 Linux would not be available (it's used to compile the kernel and all of the apps). It is ultimately optimal? No, but, it's getting there and will be more optimal once real hardware is present (probably is now with Hammer sampling) and 100% functional (only AMD and a few others know if this is true). The code generator in the compiler can be extended from RISC machines easily wrt the additional 8 GPRs and 8 SSE registers.

BTW, in my example the 7 virtual registers are named something like EAX orig, EAX t3, EAX t4, EAX t5, ... with the retire counter at 2. The next cycle increments the retire counter to 3, EAX orig is freed and EAX t3 becomes EAX orig This is done via a map of registers at each counter and the orig set is kept at the retire counter. It's the map that is updated and it contains the virtual register's preset address say 0xda. If the set needs to be flushed, the retire counter remains at the current value and the next stages become overwritten over time.

This is normal now that OOE is the very prevalent. Further reading will net more details, if you want them.

Pete



To: Petz who wrote (74115)3/9/2002 5:09:47 PM
From: Joe NYCRespond to of 275872
 
Petz,

Once a memory location is used to store a temporary result, a lot of inefficiency is introduced, and on top of the overhead of a single CPU system, you have to take into account multi-processor system, where the memory location may be used for another processor.

Also, think of the advantages of reduced memory accesses in a multiprocessor system. Anytime a temp variable is read back, (either from L1, L2, main memory), the processor may need to check with all the other processors if they are not using it. With use of additional registers, none of this needs to take place.

Joe