Chip architecture -- which do you like? VLIW, which stands for "very LONG instruction word" or RISC, "reduced instruction set computer"? The engineers at CUBE use microSparc, a RISC architecture. . . . eetimes.com
Analysis: VLIW chips will depend on smarter software By Alexander Wolfe EE Times (05/15/99, 7:10 p.m. EDT)
DALLAS — As architects begin to field devices — DSPs, multimedia chips and, soon, general-purpose microprocessors such as Intel's Merced — built around very-long-instruction-word (VLIW) architectures, one big question looms: How will these chips perform when battle-tested under real-world conditions?
That's no theoretical question, because it is software that will decompose applications programs into the streams of parallel instructions required to feed the numerous on-chip execution units in a VLIW device. But if that software is inefficient, the hardware will not beat out today's superscalar architectures, experts agree.
"VLIW has a different kind of complexity than RISC," said Ray Simar, a Texas Instruments Inc. Fellow, based here, and head of the company's DSP architectures. "With VLIW, you've got a wealth of riches because of all the on-chip functional units, but you have to make sure you're getting the best use out of all of them."
"Everyone's trying to get a better compiler to hook better into the hardware," said Wen-mei Hwu, chairman of the computer-engineering program at the University of Illinois at Urbana-Champaign. "Each company has its own course and constraints."
Nowhere is this more evident than in the DSP arena. There, TI is moving to VLIW with its C6X family and the team of Motorola Inc. and Lucent Technologies is jointly fielding the StarCore VLIW DSP architecture. (Analog Devices Inc. also notes that its TigerSharc executes multiple instructions per cycle, like a VLIW machine.)
Perhaps the highest profile for the technology will come from Intel, whose upcoming Merced will bring VLIW to the center stage of general-purpose computing. According to the company, Merced is not a VLIW architecture. It is an EPIC — explicitly parallel instruction computing — design. However, it does incorporate VLIW concepts, Intel admits.
To pave a path to Intel's new architecture, Merced-capable compilers are in the works at Hewlett-Packard, Microsoft, Sun Microsystems, Silicon Graphics, Cygnus Solutions, Metaware and Edinburgh Portable Compilers Ltd. (Scotland). Intel will rely on those tools as well as its own homegrown compiler technology, which it is prepared to license to other software vendors.
Today, Intel's Merced compiler "is within 10 percent of delivering our performance projections on Merced," said Ron Curry, IA-64 marketing director at Intel. "The process has been to write the compiler and simulate it on our Merced logic model, and then optimize it. We don't have Merced hardware yet; until we have that hardware, we can't do the final 10 percent of tuning that will be required."
Industry sources report that Intel is actively showing its Merced code generator and related tools throughout the marketplace. Merced samples themselves should be available later this year and production is slated for mid-2000, Intel officials said.
DSP difference
One interesting rub in how VLIW compiler technology will come together is that the issues on the DSP side are rather different from the challenges faced with Merced.
"What was attractive about VLIW was that it fit really well with a lot of the 'care-abouts' we had for DSP," said TI's Simar. "One is, we didn't want to burn a lot of transistors for control logic, because we wanted to get more performance per square millimeter, and control logic doesn't really help that."
Simar sees traditional superscalar design as ideal for general-purpose computing, noting that DSP applications tend to be better-constrained. "DSP actually does have a decent amount of regularity to it, and lends itself to an a priori [i.e., before run-time] decision about what should be done," he said. "Rather than having to put that complexity into the silicon — and the silicon really can't look that far ahead anyway — moving it into the compiler, what you have is a real nice fit between the way DSP algorithms tend to look and the compiler being able to extract parallelism out of it."
On the microprocessor side, architects have already dipped their toes into VLIW concepts in a limited way. For example, recent PA-RISC, UltraSparc, Pentium II and MIPS R10000 processors provide some support for the VLIW concepts of predicated execution (removing unnecessary branches).
However, Merced will take a quantum leap forward and pose commensurate challenges. "Code size is going to be a big issue," said the University of Illinois' Hwu. "If you look at the typical DSP application, there are in general probably only 10,000 or 20,000 lines of code. If you look at the kind of things Intel is going to have to deal with when they get Merced out to the market, these will include transaction processing running on NT-level server applications. Such apps are already big in terms of code size."
But Hwu noted that researchers across the board are taking a hard look at ways to cut code size in large applications. "In fact, there are optimizations you can do to reduce the code in the EPIC world," he said.
That's something academic researchers have been studying and writing papers on. Indeed, Intel and HP have been buttressing their IA-64 work by hiring new doctoral graduates from a field so small that the DSP makers are drawing upon the same pool of talent. There's lots of informal and often friendly interaction among the communities at industry conferences.
For their part, Intel officials strongly believe code bloat won't be a problem with Merced. "I'd say it's just the opposite," said Intel's Curry. "Large applications are easier to optimize because you have a larger chunk of code to work with."
Curry admits there will be some expansion in application size just by virtue of the new architecture. However, he noted that Merced compilers will be equipped with the intelligence not only to manage code size, but also to examine different blocks of code within an application and decide which of those blocks should be optimized for size and which for performance.
Rich Fuhler, vice president of OEM at software-tools maker Metaware, pointed out the obvious fact that DSPs and general-purpose CPUs play in divergent arenas. He noted that DSPs typically rely on five or six heavy-duty algorithms, such as fast Fourier transforms (FFTs) and their cousins. As a result, there's ample opportunity for vendors to optimize such code and provide it to customers on demand. In contrast, he noted that "a compiler for a general-purpose [VLIW] processor has a great opportunity to use every available technique to improve code performance."
Even today, compilers are often highly underrated in terms of the contribution they can make to the performance of any microprocessor. "If you look at something like a Pentium II, when it runs integer code, you are talking about issuing up to three X86 instructions per cycle. [This equates to issuing] up to four [micro-ops] into the reservation station and retiring up to four per cycle," the University of Illinois' Hwu explained. "However, when everything is said and done, you may be talking about as few as one X86 instruction per cycle after you've discounted everything.
"As it turns out," Hwu continued, "even though all the out-of-order execution machines are supposed to be designed to be less dependent on the compiler, I have not seen any out-of-order machine that doesn't benefit quite dramatically from compiler optimizations, especially in scheduling." In the superscalar world, this entails having the compiler schedule ops for the issue units and/or for the retire-ment stage.
In contrast, compilers for IA-64 will have to grapple with its larger complement of execution units. Hence, optimization techniques will assume overarching importance.
"What Intel and HP are probably talking about [for IA-64] — and I can't represent their opinion — is to say, 'Well, we're providing the compiler with the tools to do the optimizations that weren't done for X86,' " Hwu said. "For example, there's predication support. Also, if the compiler can do the appropriate optimizations, there are two major benefits. First, the code is intrinsically more efficient. And secondly, the code is better scheduled and more parallel . . . to the hardware.
"In other words," Hwu continued, "you can get the code to flow through at two or three instructions per cycle." That would constitute quite an achievement, if accomplished on a regular rather than intermittent basis.
Indeed, in lab tests Hwu conducted on his ground-breaking Impact compiler (a software test bed for VLIW), he found that it was possible to double the amount of parallelism extracted from an application.
Good code, bad code
Meanwhile, TI is seeing early examples of VLIW code from its customers, which indicates that developers will be plowing new territory as they move up the learning curve. "You see a tremendous variety in what people write — there's good code and bad code," Simar said. "We're doing a lot more analysis of large bodies of code. We're looking at algorithms to find out where parallelism is possible."
He noted that functions such as FFTs and finite-impulse-response filters do well. Surprisingly, Simar said, infinite-impulse-response filters don't do as well because of the way the delay samples move through the inner loops.
As VLIW users acquire such insights, they may have to jettison some traditional techniques in favor of new algorithms. "There will be this natural selection that will start to happen," Simar said. "That's going to be really interesting and part of our challenge." |