To: Charles Hughes who wrote (14245 ) 11/18/1997 5:03:00 PM From: Justin Banks Read Replies (1) | Respond to of 24154
Chaz - Well this "instruction bundle" thing is certainly going to cause code bloat -- 128 bits for 3 instructions? Wow. About 1.5 to 1.8 times bigger code to do the same thing -- as compared to x86, I bet. Do the math on the registers -- there's 2K worth of register data in there -- I wonder what that's going to do to context switch times. I'm sure they're doing some tricks to solve the problem -- it would be interesting to hear what they are. Another interesting question is how they're going to emulate x86 instructions without building x86 problems into Merced (gate delay, especially). The compiler problems will be quite diffucult as well. I imagine that was a pretty big part of the reason INTC made their deal with DEC, as the Alpha compilers do some pretty amazingly aggressive optimization. Their slides say: 1. "Flexibly groups any number of independent instructions" and 2. "Simplifies hardware by removing dynamic mechanisms". and 3. "Fully-interlocked hardware provides compatibility" In a classic VLIW, or even an LIW (like i860 running dual piped), the instructions within a bundle are independent, like horizontal microcode, which means that each field within each long word can be handled independently. Merced isn't that. It may be that each bundle can only have independent instructions[1], or it may be that the template tells the hardware which are independent, but everything is interlocked anyway[3], or it may be that each bundle has independent instructions[1], but inter-bundle dependencies are interlocked[3]. Note that [2] didn't say "eliminated" In any case, there are clearly a bunch of dynamic interlocks EPIC gets 3 instructions per 128-bit bundle; useful thought question is: a) Suppose a compiler knows that we fetch aligned 128-bit 4-instruction sequences, and does its best to order instructions for this [some compilers do this, especially for Alpha 21164]. b) Suppose on I-cache miss, you do some checking of the incoming 4 instructions, expand them, add a "template", and keep more bits in the cache line, i.e., use a decoded-instruction-cache (R10K does a little of this). You know have something that looks like an EPIC bundle, with dependencies marked. Hence, at the cost of (maybe) 1 more cycle of L1 cache miss, you get something like EPIC, although they'd more explicit registers, and you'd still have to do renaming. We've been told that we're seeing the Mona Lisa, but we've only been shown 3 pixels, so it's hard to tell. Other killer stuff includes: (a) Bandwidths thru the memory hierarchy. (b) Latencies, especially of dependent pointer-following. Nothing they showed so far helps this very much, we all have the same miserable problem. Likewise, all this object-oriented code, with function calls loaded from pointers, is miserable for everybody. (c) Necessary interlocks for loads & stores to same addresses, or ones that might be the same. It is rumored that load/store instructions do not calculate addresses, somewhat akin to AMD 29K; if that's so, then they added a lot of instructions. The HP architects are good, and don't usually do stupid things. On the other hand, I've heard some of them muttering evil words about the difficulty of getting Intel to do anything sensible :-) -justinb