SI
SI
discoversearch

We've detected that you're using an ad content blocking browser plug-in or feature. Ads provide a critical source of revenue to the continued operation of Silicon Investor.  We ask that you disable ad blocking while on Silicon Investor in the best interests of our community.  If you are not using an ad blocker but are still receiving this message, make sure your browser's tracking protection is set to the 'standard' level.
Technology Stocks : Advanced Micro Devices - Moderated (AMD)
AMD 214.96+5.5%Nov 24 3:59 PM EST

 Public ReplyPrvt ReplyMark as Last ReadFilePrevious 10Next 10PreviousNext  
To: Ali Chen who wrote (74072)3/8/2002 11:33:46 PM
From: milo_moraiRead Replies (1) of 275872
 
Some thoughts on<font color=blue> Latency</font> from ACE's article via RB RMBS board.
aceshardware.com
By: elazardo $$$$$
08 Mar 2002, 08:03 AM EST Msg. 80343 of 80352
(This msg. is a reply to 80341 by q_azure_q.)
milo, I think that the effect is likely even more pronounced than you infer.

Just for anyone who might care:

To really see the effects, calculate net throughput. Let's say that the time to refill a cache line is approximately 100nS, ( core to No bridge to memory to No bridge back to core ). If a cache hit returns data in 2 clk cycles, and we have a 96% hit rate, then total cycles is:

( 0.96 X 2 ) + ((100E-9)*2E9 X 0.04) = 9.92 cycles ave.
Useable work = 2E9/9.92 2.0E8.

Now, change the numbers to 3GHz:

( 0.96 X 2 ) + ((100E-9)*3E9 X 0.04) = 13.92 cycles ave.
Useable work = 3E9/13.92 = 2.2E8

A 50% increase in processor speed results in only a 10% gain in net throughput.

Improving the cache turn-around to 1 clock cycle makes little difference:

2E9/8.96 = 2.2E8, and
3E9/12.96 = 2.3E8

We only improved by 5%. The main memory latency totally dominates.

Now, if you can reduce the memory delay to 60ns total from the core to returned data, and you can operate at lower frequencies, the ABSOLUTE throughput improves assuming equal work for an equal number of unblocked instructions:

( 0.96 X 2 ) + ( 60E-9 * 1.5E9 * 0.04 ) = 5.52
1.5E9 / 5.52 = 2.7E8
vs
( 0.96 X 2 ) + ( 60E-9 * 2.3E9 * 0.04 ) = 7.44
2.3E9 / 7.44 = 3.1

Here, we see a 1.5 GHz device with a 2clk cache outperforming a 3GHz device with a 1clk cache because of the latency issue, even with high cache hit rates. In real life the situation is a little better, but not a lot. When the core can chew up more than 1 instruction per ns, waiting dozens of ns for a cache miss makes higher clock rates almost completely futile.

The incremental gain of getting the cache hit delay down to 1 cycle improves by 20% versus what we saw above:

1.5E9 / 4.56 = 3.3E8
or
2.3E9 / 6.48 = 3.5E8

It is the right hand terms that dominate the performance:
Absolute latency, clock rate, and cache miss rate. The only way to significantly improve performance is to work the absolute latency and cache miss rate, as the clock rate term appears in both the numerator and denominator at almost equal weight, and so almost cancels itself out. This is where I think INTC must have lost its mind by going for a combination of high clock rate and a modest cache.
I know they have sophisticated modelling, but something went terribly wrong. It is almost as though they hired Fleishman and Pons to evaluate the models.

Regards,


If you haven't read this yet this is the previous Post #reply-17169398

Maybe that's why JS is so confident with Hammer's integrated Memory controller.
Report TOU ViolationShare This Post
 Public ReplyPrvt ReplyMark as Last ReadFilePrevious 10Next 10PreviousNext