<font color=green>BMG's take on Latency
Author: BlackMagicGuy Number: of 94769 Subject: Re: Latency comments on ACE's article Date: 3/9/02 9:28 AM Post New • Post Reply • Reply Later • Create Poll Report this Post • Recommend it! Recommendations: 0
Alan,
See boards.fool.com .
I took a shot at answering this very question back in November. Read that note before going through this one as it's additive.
I assumed a .1% cache miss rate, and 100 vs 40 ns cycle times.
I pointed out a number of problems. There are so many variables that depend on the code. Cache miss rates, page hits vs misses, mispredicted branch instructions, prefetching, refresh cycles, and so forth.
All of these variables impact the total cycle time. Also, the difference between the integrated NB & external NB parts also vary. As Eachus said, with the integrated, you have better control down to the last clock before fetching as to whether to execute or skip a memory request. Consider it similar to Just-In-Time manufacturing techniques.
Now, there are a few fixed-time differences. For a 266 MHz bus (3.76 ns clock), it takes 4 clocks to generate the request, or 15 ns. Unfortunately, even this really is variable. IF there are other requests pending the 'slow' bus, the request has to wait behind each of them, or at least away the one currently executing. In my last note I made this 20 ns, maybe a little conservative there. If the queuing already takes place in the CPU anyway, there is no reason to repeat the queuing decision, which I said was probably 25 ns in the NB chip.
Now, once the data is retrieved from the DIMM, it must go back across the CPU bus. Again, 20 ns chosen, but this is variable just as the first one was above. Also, note that since Athlon has an OoO data bus, the NB must first tell the CPU that the following data was for cycle XX requested a while ago. I gave this zero time, but it could be another 10+ ns delay, again lots of variables here.
Now, the actual cycle time to memory, assuming it's a page hit, can be darn fast. This used to be 100+ ns type numbers, but today is in the 10's of ns, and getting quicker. My 40ns number assumes mostly page hits, blah blah blah. It is reasonable.
So, you get a 60% net improvement in latency. That is substantial.
Now, as I said, some of these numbers have huge variation. Code like CR's probably has a cache miss rate of nearly 10%, depending on how you view it (ie, with each line you read, you're sure to have a cache line that is dirty that needs to get written out to memory, hence it can be viewed as a double miss where as most normal code has a great deal of cache lines that simply get thrown away because they haven't been modified. CR's code is so miss-heavy, that latency AND bandwidth are equally important. Heck, for code like that, how you organize your arrays and the arrangement of memory sticks can be just as meaningful. For instance, CR should consider having more smaller memory sticks as they will have more memory RAS pages that can be open at once, and arrange the array sizes so that if he works with smaller sub-arrays, he can keep as much of them on the same page as well. However, at this point you get into a fine science, but it is necessary for optimization of code which can take not only hours, but even days to run.
My assumption was that code would have a 99.9% hit rate, and I guessed at a 18% performance advantage. Now, if I increase that to 1% (still assuming a 2 GHz CPU with an IPC of 2), I get a calculation of ( 990 x 0.25 ns ) + ( 10 x 40 ns ) = 650 ns for Hammer, and just replace the 40 ns with 105 ns & it skyrockets to 1300 ns, or exactly double the performance. Note, 1% miss rates aren't uncommon at all. This advantage continues to increase until you hit a limitation due to memory bandwidth. Of course, Hammer has an answer to that problem as well, just go with Sledge and multiple CPUs, and there memory bandwidth gets into Super-computer levels. Intel's answer to this: they'll have to get back to us with an answer... ;D
This is a very interesting topic. But, I've written enough. Comments & thoughts?
Black Magic
-------------------------------------------------------------
This is his previous post
View UnThreaded • Threaded < Thread • Prev • Next • Thread > Author: BlackMagicGuy Number: of 94769 Subject: Memory interface benefits Date: 11/29/01 11:18 PM Post New • Post Reply • Reply Later • Create Poll Report this Post • Recommend it! Recommendations: 18
All,
This is really a response to Koralis who posted boards.fool.com , but I wanted to start another thread due to my focus.
I've been wanting to discuss this for some time, but have been waiting for the info to be more public. Hammers will all have built-in memory controllers. Yes, this does take die space & such, but I really want to talk about performance benefits.
First, latency. Today, with any CPU, the CPU needs to decide that it wants to read a location (verify not in L1 or L2,...). It tells the NB over a relatively slow bus (133-400 MHz for CPUs today). The NB needs to put this into a queue behind any other previous memory accesses (or possible delays if integrated video memory accesses). After it gets to the head of the queue, it reads the memory from the RAM, and finally posts back to the CPU the results. Let me put this in time-line fashion:
(1)CPU Rec on bus -> (2)put in queue -> (3)RAS-CAS cycles to memory -> (4)read/write to memory -> (5)respond over CPU bus.
Compare this to what Hammer can do:
RAS-CAS cycles to memory -> read/write data to memory
For ages there was the debate of DDR vs Rambus, and concensus has been that the lower latency of DDR wins in most cases. For the two cases above, let me break them down into delays. Today's method (using rough estimates): (1) 20 ns -> (2) 25 ns -> (3 & 4) same for both, maybe 40 ns -> 20 ns
Keep in mind that all of these numbers can vary, whether there are conflicting cycles, RAS hits, etc, it can be much faster or slower. I just tried to pick middle numbers. So, the latency drops from just over 100 ns to about 40, or a 60% improvement.
When the CPU is stalled waiting for instructions or data, this number can be a very big factor. Even if 0.1% of instructions have this wait, let me add this up. Assume you have a 2GHz cpu with 2 IPC. An instruction will take .25 ns to execute, and 999 instructions will take 249.75 ns. If that last instruction needs to be fetched through a NB, the total is about 355 ns. If it is a Hammer, that drops to 290 ns. So, that is a 18% improvement simply by bypassing the NB with everything else identical. And, the higher the MHz and IPC, the bigger the difference. Good idea?
I'm not sure where I want to take this. For now, let me just put this out there. I'm not familiar with Itanium's architecture, but I don't think it does this. As I see it, this is just one more step in the list of advantages that AMD has over Intel.
Some might argue that this ties a CPU to a single memory type, or it reduces the ability of MBs to offer unique features. Hogwash. Maybe there are some other arguments that do hold water, I'd like to hear them.
Please, toss in your $0.02, let me know what you think... And again, thanks koralis for bringing it up!
Black Magic Email this Post Format for Printing
Also found this post
Author: BlackMagicGuy Number: of 94769 Subject: Re: Memory interface benefits Date: 12/1/01 3:02 PM Post New • Post Reply • Reply Later • Create Poll Report this Post • Recommend it! Recommendations: 0
Chuck,
First, for all you non-techies out there, you might want to skip this one. Way too many numbers and calculations...
Chuck, look at your array. How many BYTES of data are associated with each point? Keep in mind that, for example with Athlons, there is a 128kB L2 data cache and 64 kB L1 data cache. Allow a bit for storing any overhead: interrupts, etc. If your array for each location has 3 variables of 80 bits each, that's 11 bytes each. There may be physical properties there as well, allow maybe 20 bytes for all of that (you will obviously know this). Put that together, in this case you have (3x) 11 + 20, or 53 bytes per array. To fit that into 150kB of cache, that will mean you have at most roughly 3,000 points. That's not much more than a 10x10x10 array.
However, this may be the key. You really want to have the area you are working with all in L1 cache. Assume your total array is A x B x C. Take a M x M x C piece of that. Run your calculation across that area as a diagonal cut. That will ensure maximum availability in L1 cache of needed data.
Does the diagonal cut make sense? Each point only affects points immediately beside it in X,Y,& Z. Elements diagonal from them don't matter and that's why the diagonal cut is needed, maximize the number of relationships between elements in the array. For instance, the first four steps would be:
0: 0,0,0 1: 1,0,0 ; 0,1,0 ; 0,0,1 2: 2,0,0 ; 0,2,0 ; 0,0,2 ; 1,1,0 ; 1,0,1 ; 0,1,1 3: 1,1,1 ; 3,0,0 ; 0,3,0 ; 0,0,3 ; 2,1,0 ; 2,0,1 ; 1,2,0 ; 0,2,1 ; 1,0,2 ; 0,1,2
Now, if I want this to fit into maybe 40 kB of L1 data cash (conservative number compensating for constants, cache line % fullness, etc, you can fiddle with this to get optimum amount), and their are 50 bytes/array element, that gives you 800 entries that you can hold. For a M x M array, the number of elements will be the sum of 1:M, inclusive. So, for M = 1,2,3,4,5,etc ; your total number of elements will be 1,3,6,10,15,etc. I used excel with it's simple math functions to calculate that for my assumptions above, 39 x 39 is the maximum array area that you could use while keeping all your data in the 40k section of L1 cache.
Also, each time you traverse the diagnonal cut, first go from bottom to top, then from top to bottom, and start in the beginning again. This again maximizes the number of related data points you still have in L1. Or, you can cut the array in half to keep two sweeps across it in memory. IOW, a 27 x 27 array will hold just under 400 elements.
If you can get down to the compiler level, you can also optimize your order of instructions such that you have an optimized use of all 3 paths within the FPU. I believe one does complicated instructions like multiplication, one does simple addition, and one does loads and stores. Keep in mind approximately how many cycles are needed for each of them, and you can allocate with that in mind as well. If you are writing in C, it isn't that difficult to write small assembly-level piece of code that gets called by your higher-level C-code. This assembly code could be written to do the base calculations for each array element.
Optimizations on top of optimizations...
This all takes quite a bit of effort, but for calculations which normally takes days, these can make some real improvements. I hope all of this helps. If you have questions, let me know.
Black Magic
|