SI
SI
discoversearch

We've detected that you're using an ad content blocking browser plug-in or feature. Ads provide a critical source of revenue to the continued operation of Silicon Investor.  We ask that you disable ad blocking while on Silicon Investor in the best interests of our community.  If you are not using an ad blocker but are still receiving this message, make sure your browser's tracking protection is set to the 'standard' level.
Technology Stocks : Advanced Micro Devices - Moderated (AMD) -- Ignore unavailable to you. Want to Upgrade?


To: minnow68 who wrote (74081)3/8/2002 4:14:55 PM
From: PetzRead Replies (2) | Respond to of 275872
 
Mike, what Ali is referring to with "register renaming technique" is that the microprocessor itself makes optimizations to the code as it is running. For example, suppose we have the code,
x=(x*c)-(y*s);
y=(x*c)+(y*s);
/* this could be part of an FFT */
The compiler may invent a temporary variable to hold either or both products (call them TEMPIJ and TEMPKL), if it doesn't have enough registers available. But when the code actually runs inside the Hammer, it has been converted to RISC86 micro-ops. This code stores both products in registers that the programmer or the compiler doesn't even know about. The CPU just uses any register than doesn't contain anything at the moment, and "renames" it to be whatever register the compiler generated code for.

Although it will still store a value into TEMPIJ and TEMPKL, since it can't easily analyze if some later code is going to try and access these memory locations, it will retrieve TEMPIJ and TEMPKL from the registers they were calculated in, registers that 1) the compiler did not know about, and 2) the programmer did not know about.

The memory writes to TEMPIJ and TEMPKL are probably inconsequential for execution speed, because they get written to the L1 cache making that cacheline "dirty", but probably don't get written to memory until that cacheline needs to be replaced, most likely not within an inner loop. So if these statements are part of a loop that gets executed 1000 times, TEMPIJ and TEMPKL probably get written *once* and read *nunce*.

So the 64-bit Hammer should have *lots* of extra registers available for the CPU to use whenever it wants to store temporary variables. Since the top 32 bits of the special 64 bit registers are not even accessible to 32-bit code, the CPU will already do some of what a good assembly language programmer or superb compiler would do if it knew about them.

The only possible "gotcha" with this idea is that, if the x86-64 programming guide says that you don't have to save and restore registers when going from 64 to 32 and back to 64 bit mode, then the CPU had better not mess around with stuff the programmer has assumed is going to stay unchanged on mode-switches.

Having said all this, compression and decoding operations may be more efficient with 64 bit integer arithmetic. IIRC, zipping and unzipping jumped in efficiency going from 16 to 32 bits.

Petz



To: minnow68 who wrote (74081)3/8/2002 4:57:41 PM
From: Ali ChenRespond to of 275872
 
minniw, "..on RISC processors there is often an order of magnitude decrease in memory references when compared to x86 processors. This is simply because of fewer registers in x86 so values must be kept in memory."

Ok, now I see the difference - register renaming cannot
eliminate an excess of precoded memory references.
However, those excessive memory references are most likely
served from L1 cache, so the performance penalty is
not too high. (just grasping the straw ;-)

"As compilers mature, it would not surprise me at all to see a typical x86-64 program actually have _higher_ code density that x86-32."

Here is a caveat. There is no more AMD-friendly
compiler development teams, maybe except the
politically-correct Microsoft - all others were
bought by Intel with all roots.

Of course, Linux could be the biggest beneficiary of
the x86-64 if their gnucc can catch up quickly on
the AMD-64 extensions, and increase overall
efficiency they were lacking so far.

"BTW, I've read the x86-64 spec, and I have to say that I absolutely love it. x86-32 was a big improvement over x86-16. But that pales in comparison to how much cleaner x86-64 is than x86-32. IMHO, the AMD folks should be extremely proud of themselves."

No objections here ;-) It would be nice however to
have really good SPEC2000 scores, in addition to it.

Regards,

- Ali



To: minnow68 who wrote (74081)3/9/2002 8:19:57 AM
From: dale_laroyRespond to of 275872
 
>There are people I respect who think we should see big performance gains, and other people I respect who think that L1 data access is so fast in x86 processors, that it simply won't matter much. I also wouldn't be surprised to see a "spotty" situation where most programs run about the same speed when recompiled with x86-64, but a few run dramatically faster.<

Those that hold that L1 access is so fast in x86 processors that it simply won't matter much would be right with P4, but Athlon, and presumably Hammer, have somewhat slower L1 cache access. Benefits that will be present in all x86-64 architectures will be increased code density as memory operands are replaced by register operands, and decreased power consumption as the L1 cache is accessed less frequently. Of course, the later is relative to x86-32 mode in Hammer, not relative to Athlon.