SI
SI
discoversearch

We've detected that you're using an ad content blocking browser plug-in or feature. Ads provide a critical source of revenue to the continued operation of Silicon Investor.  We ask that you disable ad blocking while on Silicon Investor in the best interests of our community.  If you are not using an ad blocker but are still receiving this message, make sure your browser's tracking protection is set to the 'standard' level.
Technology Stocks : Intel Corporation (INTC) -- Ignore unavailable to you. Want to Upgrade?


To: Tenchusatsu who wrote (88188)9/13/1999 11:25:00 AM
From: Rob Young  Read Replies (1) | Respond to of 186894
 
Tench,

<I just looked over the Microprocessor Report article on the 21364, and now it makes sense to me, especially when
I saw the words "directory-based coherence." Alpha's approach is much tougher to do than a shared processor bus
like Merced, and it requires both hardware and software support, but it'll be quite a killer in memory bandwidth.>

Absolutely.. and to quote John McCalpin (author of STREAM,
THE Memory Bandwidth metric) "it's the bandwidth stupid!"

<Memory bandwidth 40/20 GByte/sec>

< I don't know how these guys are cooking the numbers. The RDRAM ports on a single CPU is 6.4 GB/sec
(bidirectional). The interprocessor ports has 6.4 GB/sec going in, and 6.4 GB/sec going out, both unidirectional.
Adding those numbers up gives us 19.2 GB/sec. Where did those guys get the 40 GB/sec figure from?

Oh, wait a minute. If we count the RDRAM ports from three other CPUs, then we'll have 25.6 GB/sec. Then toss in
the interprocessor connections (which isn't multiplied by four because that bandwidth is shared) and you'll get 38.4
GB/sec. I see now.>

I'm driving down the road realizing the 40/20 number
wasn't explained from what I reasoned or the slide stated.
It has to be aggregate as you point out.

<Memory latency 60-100 ns ... Of course the latency is the big thing. That IS an outstanding number for a 4
processor config. I went and looked at others and this is from my old and foggy memory but I believe the best I
found for current day is 270 ns with 4 processors.>

< Latency for remote memory accesses will be much longer, but of course they will be kept to a minimum in a
NUMA system. As for your "current day" latency figure, that sounds like a good estimation for a 4-way shared bus
with normal TPC-C traffic.>

I think (am pretty sure) that the 270 ns is best in industry
and includes a 4 processor Sun box which is point to point.
Your comment about "remote" being much longer.. I don't
see what you mean. There are only 4 processors in a 4
processor box and the memory access remote (i.e. the memory
hanging off another processor) is 100 ns. If you look
at the 21364 it acts as a network router, are you overlooking that?

Or are you saying: What about a 32 processor machine that
is in essence 8 of these lashed together via the
Global Switch (Wildfire)? Read that the latency remotely
is 3-5 times worse .. that 3-5 times being 3*60 or 5*60..
I'll bet 4 times which gives 240 nsec remote latency.
However, the Global Switch is *only* good for 6 Gig/sec agg.
point to point. I would agree that they have eliminated
NUMA-ness because as they point out "NUMA sucks."

The question now is just how well that remote latency is and
how it looks as they scale to 128 and 256 CPUs ...

Rob