SI
SI
discoversearch

We've detected that you're using an ad content blocking browser plug-in or feature. Ads provide a critical source of revenue to the continued operation of Silicon Investor.  We ask that you disable ad blocking while on Silicon Investor in the best interests of our community.  If you are not using an ad blocker but are still receiving this message, make sure your browser's tracking protection is set to the 'standard' level.
Technology Stocks : Advanced Micro Devices - Moderated (AMD) -- Ignore unavailable to you. Want to Upgrade?


To: Ali Chen who wrote (74072)3/8/2002 3:18:47 PM
From: minnow68Read Replies (3) | Respond to of 275872
 
Ali,

You wrote "Advantage of having more register was already partially taken by register renaming technique, so explicit coding for extended registers will likely not lead to a breakthrough. 64-bit addressing will bloat the code and harm the decode bandwidth."

I write compilers for a living. I've seen actual program profile data showing that on RISC processors there is often an order of magnitude decrease in memory references when compared to x86 processors. This is simply because of fewer registers in x86 so values must be kept in memory. It is an article of faith among many CPU designers and in the world of compiler writers that x86 being register starved is by far its biggest problem. x86-64 dramatically addresses this problem. BTW, this also pretty much solves the code density problem also. Even though memory references are more expensive, if you have an order of magnitude less of them, you still come out ahead. Indeed, AMD has said that even with the early compilers, they've seen the total amount of code generated, compared to x86-32, increase by only about 10%. As compilers mature, it would not surprise me at all to see a typical x86-64 program actually have _higher_ code density that x86-32. And yes, just because of all those wonderful extra registers.

The fact that they are wider won't matter to most programs. The fact there are so many of them might make a _huge_ difference in many many programs.

Of course, I could be wrong.<G> Nothing compares to actually running the experiment. We should have solid numbers within a year of hammer shipping (I want to give the compiler writers some time to pick the low lying fruit). There are people I respect who think we should see big performance gains, and other people I respect who think that L1 data access is so fast in x86 processors, that it simply won't matter much. I also wouldn't be surprised to see a "spotty" situation where most programs run about the same speed when recompiled with x86-64, but a few run dramatically faster.

BTW, I've read the x86-64 spec, and I have to say that I absolutely love it. x86-32 was a big improvement over x86-16. But that pales in comparison to how much cleaner x86-64 is than x86-32. IMHO, the AMD folks should be extremely proud of themselves.

Best Regards,

Mike



To: Ali Chen who wrote (74072)3/8/2002 11:33:46 PM
From: milo_moraiRead Replies (1) | Respond to of 275872
 
Some thoughts on<font color=blue> Latency</font> from ACE's article via RB RMBS board.
aceshardware.com
By: elazardo $$$$$
08 Mar 2002, 08:03 AM EST Msg. 80343 of 80352
(This msg. is a reply to 80341 by q_azure_q.)
milo, I think that the effect is likely even more pronounced than you infer.

Just for anyone who might care:

To really see the effects, calculate net throughput. Let's say that the time to refill a cache line is approximately 100nS, ( core to No bridge to memory to No bridge back to core ). If a cache hit returns data in 2 clk cycles, and we have a 96% hit rate, then total cycles is:

( 0.96 X 2 ) + ((100E-9)*2E9 X 0.04) = 9.92 cycles ave.
Useable work = 2E9/9.92 2.0E8.

Now, change the numbers to 3GHz:

( 0.96 X 2 ) + ((100E-9)*3E9 X 0.04) = 13.92 cycles ave.
Useable work = 3E9/13.92 = 2.2E8

A 50% increase in processor speed results in only a 10% gain in net throughput.

Improving the cache turn-around to 1 clock cycle makes little difference:

2E9/8.96 = 2.2E8, and
3E9/12.96 = 2.3E8

We only improved by 5%. The main memory latency totally dominates.

Now, if you can reduce the memory delay to 60ns total from the core to returned data, and you can operate at lower frequencies, the ABSOLUTE throughput improves assuming equal work for an equal number of unblocked instructions:

( 0.96 X 2 ) + ( 60E-9 * 1.5E9 * 0.04 ) = 5.52
1.5E9 / 5.52 = 2.7E8
vs
( 0.96 X 2 ) + ( 60E-9 * 2.3E9 * 0.04 ) = 7.44
2.3E9 / 7.44 = 3.1

Here, we see a 1.5 GHz device with a 2clk cache outperforming a 3GHz device with a 1clk cache because of the latency issue, even with high cache hit rates. In real life the situation is a little better, but not a lot. When the core can chew up more than 1 instruction per ns, waiting dozens of ns for a cache miss makes higher clock rates almost completely futile.

The incremental gain of getting the cache hit delay down to 1 cycle improves by 20% versus what we saw above:

1.5E9 / 4.56 = 3.3E8
or
2.3E9 / 6.48 = 3.5E8

It is the right hand terms that dominate the performance:
Absolute latency, clock rate, and cache miss rate. The only way to significantly improve performance is to work the absolute latency and cache miss rate, as the clock rate term appears in both the numerator and denominator at almost equal weight, and so almost cancels itself out. This is where I think INTC must have lost its mind by going for a combination of high clock rate and a modest cache.
I know they have sophisticated modelling, but something went terribly wrong. It is almost as though they hired Fleishman and Pons to evaluate the models.

Regards,


If you haven't read this yet this is the previous Post #reply-17169398

Maybe that's why JS is so confident with Hammer's integrated Memory controller.