SI
SI
discoversearch

We've detected that you're using an ad content blocking browser plug-in or feature. Ads provide a critical source of revenue to the continued operation of Silicon Investor.  We ask that you disable ad blocking while on Silicon Investor in the best interests of our community.  If you are not using an ad blocker but are still receiving this message, make sure your browser's tracking protection is set to the 'standard' level.
Technology Stocks : Intel Corporation (INTC) -- Ignore unavailable to you. Want to Upgrade?


To: Tenchusatsu who wrote (109145)9/1/2000 12:07:10 AM
From: Rob Young  Read Replies (2) | Respond to of 186894
 
"anti-Itanium FUD at beginning"

Don't you remember that? Shortly after the MPF presentation
the Intel/HP fellows came back and showed how they
could do the 8 Queens problem in fewer instructions whereby
the Alpha folks showed they could do it in even fewer!

Besides, you must appreciate these aren't garage shop
wanna-bes or Internet wanna-bes these are serious
engineers. So when Pete Bannon (who is one of Compaq's
senior consultants .. an honor actually) trots out the
difficulties with Itanium (obvious run-time problems, i.e.
Java , and function calls, etc.) you can bet he knows
what he is talking about and it doesn't fall into
"FUD". He is showing where Itanium is weakest.

<Why 16 as a choice? Most hops are 2. When you add up the
latency for those 2 hops you are still doing much
better than an L3 hit and of course main memory.>

You probably meant "L2 hit" instead of "L3 hit." But that's not the point.

---
Actually that is what I meant. If a remote L2 that is
2 hops away has *less* latency than a *possible* L3,
then it could be stated that it has a better more
effective 24 MByte L2. Whereby another Alpha or architecture would have L1 , L2 and L3 that takes x number of cycles to access. The L3 would be 8 MByte and odds are with larger footprint stuff you miss L3 quite often .. time to go to main memory. HOWEVER, 24 MByte is more than enough of a sweet spot .. (see the earlier referenced paper for
some detail) that it is much larger win.. bag L3 all
the way around!

"latencies incurred with heavy P2P traffic, which is highly variable but must still be
considered."

No.. they use a "smart" router to ensure it takes a path
that is optimal. Note the 4 channels. System can't
be overloaded? Didn't say that. But I did see a foil
elsewhere that said something about 100 GByte/sec
main memory bandwidth .. aggregate.

Regarding the latencies ... the 15 ns is "load to use".
Maybe I'm misreading that but it includes getting it
from remote L2 and loading it into local L1. That's my
read... Besides , if you think about it a bit... they
are sandbagging there too. After all, P2P CPU is at CPU
speeds B^).

Software optimized to take advantage of it? No.. not at
all. Remember the goal here is to run programs unmodified.
You mispoke I believe .. the OSes have to be modified. That's a given and they have been and are running in a
NUMA today. Best case Wildfire local memory access today is 330 ns, remote is 960 ns. With on-chip memory controller the RDRAM local memory access is on the order of 70 ns I believe. Much better. To hit remote memory with 2 hops
you get 70 ns + 15 + 15 maybe some factor in there too
but what 130-150 ns? I think that might be right but
they don't like to talk about exactness this early.

But where it is much better and where we are talking past
each other in a sense is 2 hop remote L2 hit. Traditionally, you go over a switch. 21364 you
don't and unless I'm mistaken a 2 hop remote L2 hit
could be 45 ns (fudge factor there). You won't
get that kind of latency in a traditional arch. Since
OLTP is very latency sensitive , the 21364 will shine
in this space.

Power4 has a good thing going there too but what about 16? 32? 64 CPUs?

Can anyone afford 64 Power4 CPUs?

Re-reading.. don't get hung up on the UMA versus NUMA
thing. All Tru64 and VMS software runs unmodified
in their current Wildfire "NUMA". Most Wildfires
with Tru64 run as SSI, one big flat memory. The thing I think you are getting hung up on is that some memory accesses are faster than others. No big deal (but VERY big
deal at the OS level).

Look at the "slow" 21364
memory access. Much better than the "fastest" Wildfire
(aka GS320) memory access.

Rob