SI
SI
discoversearch

We've detected that you're using an ad content blocking browser plug-in or feature. Ads provide a critical source of revenue to the continued operation of Silicon Investor.  We ask that you disable ad blocking while on Silicon Investor in the best interests of our community.  If you are not using an ad blocker but are still receiving this message, make sure your browser's tracking protection is set to the 'standard' level.
Technology Stocks : MSFT Internet Explorer vs. NSCP Navigator -- Ignore unavailable to you. Want to Upgrade?


To: Charles Hughes who wrote (14245)11/18/1997 5:03:00 PM
From: Justin Banks  Read Replies (1) | Respond to of 24154
 
Chaz -

Well this "instruction bundle" thing is certainly going to cause code
bloat -- 128 bits for 3 instructions? Wow. About 1.5 to 1.8 times bigger code to do the same thing -- as compared to x86, I bet. Do the math on the registers -- there's 2K worth of register data in there -- I wonder what that's going to do to context switch times. I'm sure they're doing some tricks to solve the problem -- it would be interesting to hear what they are.

Another interesting question is how they're going to emulate x86 instructions without building x86 problems into Merced (gate delay, especially).

The compiler problems will be quite diffucult as well. I imagine that was a pretty big part of the reason INTC made their deal with DEC, as the Alpha compilers do some pretty amazingly aggressive optimization.

Their slides say:

1. "Flexibly groups any number of independent instructions"
and
2. "Simplifies hardware by removing dynamic mechanisms".
and
3. "Fully-interlocked hardware provides compatibility"

In a classic VLIW, or even an LIW (like i860 running dual piped),
the instructions within a bundle are independent, like horizontal
microcode, which means that each field within each long word can be
handled independently. Merced isn't that.

It may be that each bundle can only have independent instructions[1],
or it may be that the template tells the hardware which are independent, but everything is interlocked anyway[3], or it may be that each bundle has independent instructions[1], but inter-bundle dependencies are interlocked[3]. Note that [2] didn't say "eliminated"

In any case, there are clearly a bunch of dynamic interlocks

EPIC gets 3 instructions per 128-bit bundle; useful thought question is:

a) Suppose a compiler knows that we fetch aligned 128-bit 4-instruction sequences, and does its best to order instructions for this [some compilers do this, especially for Alpha 21164].

b) Suppose on I-cache miss, you do some checking of the incoming 4
instructions, expand them, add a "template", and keep more bits in
the cache line, i.e., use a decoded-instruction-cache (R10K does a little of this). You know have something that looks like an EPIC bundle, with dependencies marked. Hence, at the cost of (maybe) 1 more cycle of L1 cache miss, you get something like EPIC, although they'd more explicit registers, and you'd still have to do renaming.

We've been told that we're seeing the Mona Lisa, but we've only been shown 3 pixels, so it's hard to tell.

Other killer stuff includes:
(a) Bandwidths thru the memory hierarchy.
(b) Latencies, especially of dependent pointer-following.
Nothing they showed so far helps this very much, we all have the
same miserable problem. Likewise, all this object-oriented
code, with function calls loaded from pointers, is miserable
for everybody.
(c) Necessary interlocks for loads & stores to same addresses, or ones that might be the same.

It is rumored that load/store instructions do not calculate
addresses, somewhat akin to AMD 29K; if that's so, then they added a lot of instructions.

The HP architects are good, and don't usually do stupid things. On the other hand, I've heard some of them muttering evil words about the difficulty of getting Intel to do anything sensible :-)

-justinb