SI
SI
discoversearch

We've detected that you're using an ad content blocking browser plug-in or feature. Ads provide a critical source of revenue to the continued operation of Silicon Investor.  We ask that you disable ad blocking while on Silicon Investor in the best interests of our community.  If you are not using an ad blocker but are still receiving this message, make sure your browser's tracking protection is set to the 'standard' level.
Technology Stocks : Intel Corporation (INTC) -- Ignore unavailable to you. Want to Upgrade?


To: Tenchusatsu who wrote (128527)2/28/2001 2:08:42 PM
From: Rob Young  Read Replies (1) | Respond to of 186894
 
Tench,

You have got the hops down and using the slides can
determine latency. It is very good as you can assess.
i.e. if each hop takes 13 cycles (see slide 13) 14 hops is a very low number on a wall clock (compare 400 ns latency for a true 64 CPU UMA , Sun's UE10000). AND the best thing is that is *worst* case. When you do *averages* you see that you have amazing numbers.

Regarding infrastructure and cost, the key to keeping
that down is leaving out external cache (L2 or L3
depending on on-chip L2, etc.). You can see
that L2 is shared CPU<->CPU via router. Secondly, with
on-chip memory controllers and routers your supporting chipset
costs are much lower (as we noted earlier). Finally,
256 CPUs in a single box will be cheaper
than 64 CPUs in 4 boxes , plus have a Terabyte of memory
to play with. Note: Cheaper is a relative term, hope
we can agree on that :-)

"[Won't happen] Period. It's a wild fantasy to think that 256 CPUs can scale UMA-style"

But then you go about the business of determining hops
and from the technical data we know what that translates
to in wall clock time.... so what was so bad about it again?

Surely, Compaq is winning all but one SuperComputer bid
for some strange reason?

Rob



To: Tenchusatsu who wrote (128527)2/28/2001 10:39:47 PM
From: Rob Young  Respond to of 186894
 
Tench,

"Period. It's a wild fantasy to think that 256 CPUs can scale UMA-style. For a normal read transaction, the worst-case latency is 14 hops (seven hops for the request, seven hops for the returned data), meaning the average will be seven hops. "

I had done this by hand at one time so that sounded
too good (did an 8x8). I found a formula for 2D torus , average
hops (one way) is 1/2 * sqrt(N) or if 256, 8 hops. Meaning
16 roundtrip. Worst case is sqrt(N) or 16 hops one way,
32 roundtrip which would put worst case latency around 450
ns. But I do believe it is 2-3% of the memory requests
and much of the NUMAness will be hidden by OS features,
i.e. local copies of read-only OS pages, etc. I'm going
to try to dredge some info as it seems to be an interesting
problem and would like to know worst case, average ,etc.
and it appears that information can be gleaned for several
combinations of CPU, 4, 8, 16, etc.

I believe the formula is correct as it works out for
16 CPUs (worst case is 4 hops anyhow :-)

Rob