Silicon Investor (SI) -- The First Internet Community

STOCKTALK

We've detected that you're using an ad content blocking browser plug-in or feature. Ads provide a critical source of revenue to the continued operation of Silicon Investor. We ask that you disable ad blocking while on Silicon Investor in the best interests of our community. If you are not using an ad blocker but are still receiving this message, make sure your browser's tracking protection is set to the 'standard' level.

Technology Stocks : Intel Corporation (INTC) -- Ignore unavailable to you. Want to Upgrade?

To: Tenchusatsu who wrote (128527)	2/28/2001 2:08:42 PM
From: Rob Young	Read Replies (1) \| Respond to of 186894

Tench, You have got the hops down and using the slides can determine latency. It is very good as you can assess. i.e. if each hop takes 13 cycles (see slide 13) 14 hops is a very low number on a wall clock (compare 400 ns latency for a true 64 CPU UMA , Sun's UE10000). AND the best thing is that is worst case. When you do averages you see that you have amazing numbers. Regarding infrastructure and cost, the key to keeping that down is leaving out external cache (L2 or L3 depending on on-chip L2, etc.). You can see that L2 is shared CPU<->CPU via router. Secondly, with on-chip memory controllers and routers your supporting chipset costs are much lower (as we noted earlier). Finally, 256 CPUs in a single box will be cheaper than 64 CPUs in 4 boxes , plus have a Terabyte of memory to play with. Note: Cheaper is a relative term, hope we can agree on that :-) "[Won't happen] Period. It's a wild fantasy to think that 256 CPUs can scale UMA-style" But then you go about the business of determining hops and from the technical data we know what that translates to in wall clock time.... so what was so bad about it again? Surely, Compaq is winning all but one SuperComputer bid for some strange reason? Rob

To: Tenchusatsu who wrote (128527)	2/28/2001 10:39:47 PM
From: Rob Young	Respond to of 186894

Tench, "Period. It's a wild fantasy to think that 256 CPUs can scale UMA-style. For a normal read transaction, the worst-case latency is 14 hops (seven hops for the request, seven hops for the returned data), meaning the average will be seven hops. " I had done this by hand at one time so that sounded too good (did an 8x8). I found a formula for 2D torus , average hops (one way) is 1/2 * sqrt(N) or if 256, 8 hops. Meaning 16 roundtrip. Worst case is sqrt(N) or 16 hops one way, 32 roundtrip which would put worst case latency around 450 ns. But I do believe it is 2-3% of the memory requests and much of the NUMAness will be hidden by OS features, i.e. local copies of read-only OS pages, etc. I'm going to try to dredge some info as it seems to be an interesting problem and would like to know worst case, average ,etc. and it appears that information can be gleaned for several combinations of CPU, 4, 8, 16, etc. I believe the formula is correct as it works out for 16 CPUs (worst case is 4 hops anyhow :-) Rob