To: Tenchusatsu who wrote (109145 ) 9/1/2000 12:07:10 AM From: Rob Young Read Replies (2) | Respond to of 186894 "anti-Itanium FUD at beginning" Don't you remember that? Shortly after the MPF presentation the Intel/HP fellows came back and showed how they could do the 8 Queens problem in fewer instructions whereby the Alpha folks showed they could do it in even fewer! Besides, you must appreciate these aren't garage shop wanna-bes or Internet wanna-bes these are serious engineers. So when Pete Bannon (who is one of Compaq's senior consultants .. an honor actually) trots out the difficulties with Itanium (obvious run-time problems, i.e. Java , and function calls, etc.) you can bet he knows what he is talking about and it doesn't fall into "FUD". He is showing where Itanium is weakest. <Why 16 as a choice? Most hops are 2. When you add up the latency for those 2 hops you are still doing much better than an L3 hit and of course main memory.> You probably meant "L2 hit" instead of "L3 hit." But that's not the point. --- Actually that is what I meant. If a remote L2 that is 2 hops away has *less* latency than a *possible* L3, then it could be stated that it has a better more effective 24 MByte L2. Whereby another Alpha or architecture would have L1 , L2 and L3 that takes x number of cycles to access. The L3 would be 8 MByte and odds are with larger footprint stuff you miss L3 quite often .. time to go to main memory. HOWEVER, 24 MByte is more than enough of a sweet spot .. (see the earlier referenced paper for some detail) that it is much larger win.. bag L3 all the way around! "latencies incurred with heavy P2P traffic, which is highly variable but must still be considered." No.. they use a "smart" router to ensure it takes a path that is optimal. Note the 4 channels. System can't be overloaded? Didn't say that. But I did see a foil elsewhere that said something about 100 GByte/sec main memory bandwidth .. aggregate. Regarding the latencies ... the 15 ns is "load to use". Maybe I'm misreading that but it includes getting it from remote L2 and loading it into local L1. That's my read... Besides , if you think about it a bit... they are sandbagging there too. After all, P2P CPU is at CPU speeds B^). Software optimized to take advantage of it? No.. not at all. Remember the goal here is to run programs unmodified. You mispoke I believe .. the OSes have to be modified. That's a given and they have been and are running in a NUMA today. Best case Wildfire local memory access today is 330 ns, remote is 960 ns. With on-chip memory controller the RDRAM local memory access is on the order of 70 ns I believe. Much better. To hit remote memory with 2 hops you get 70 ns + 15 + 15 maybe some factor in there too but what 130-150 ns? I think that might be right but they don't like to talk about exactness this early. But where it is much better and where we are talking past each other in a sense is 2 hop remote L2 hit. Traditionally, you go over a switch. 21364 you don't and unless I'm mistaken a 2 hop remote L2 hit could be 45 ns (fudge factor there). You won't get that kind of latency in a traditional arch. Since OLTP is very latency sensitive , the 21364 will shine in this space. Power4 has a good thing going there too but what about 16? 32? 64 CPUs? Can anyone afford 64 Power4 CPUs? Re-reading.. don't get hung up on the UMA versus NUMA thing. All Tru64 and VMS software runs unmodified in their current Wildfire "NUMA". Most Wildfires with Tru64 run as SSI, one big flat memory. The thing I think you are getting hung up on is that some memory accesses are faster than others. No big deal (but VERY big deal at the OS level). Look at the "slow" 21364 memory access. Much better than the "fastest" Wildfire (aka GS320) memory access. Rob