Home Headlines Careers Columns IP Watch
Microprocessor Forum: DSP architectures vie for telecom slots
By Stephan Ohr EE Times (10/14/98, 2:02 p.m. EDT)
SAN JOSE, Calif. — Two new DSP architectures show entirely different approaches to the problems of power and programmability.
The StarCore 400 architecture, jointly developed by Motorola Inc.'s Semiconductor Products Sector (Austin, Texas) and Lucent Technologies (Murray Hill, N.J.), is a compiler-driven design intended to extract maximum performance from programs written in C code. It features a scalable computational model and instruction set.
The new TigerSharc superscalar architecture from Analog Devices Inc., which claims high-performance for programs generated in C or other high-level languages, is built on the rapid-fire response of short instructions piped in by a 12-Gbyte/second bus bandwidth. TigerSharc promises to execute up to 2 billion multiply-accumulate (MAC) operations per second.
Both architectures, unveiled at the 11th annual Microprocessor Forum, will compete with Texas Instruments Inc.'s C6X devices for design wins in telecommunications line cards and cellular basestations.
The StarCore architecture will compete head-to-head with TI's C6X architecture in offering the parallelism that telecom switching systems seem to want. It also pursues the C6X goal of eliminating assembly language programming. Rather than using a very long instruction word (VLIW) approach, StarCore uses a variable length execution sets (VLES) which can expand or shrink with each hardware implementation of StarCore. "This is 'post-VLIW,' " said Kevin Kloker, architecture director of StarCore and deputy director of Motorola's StarCore design center. "This is VLIW done right."
Indeed, the StarCore 400 qualifies as one of the industry's first "compiler-driven" DSP designs, in which the C-code compiler and the hardware are precisely tuned to each other. The variable length execution sets (VLES) are actually groupings of basic instructions intended for specific execution units. In the operation of the compiler, the C code is scanned with reference to a specific StarCore implementation, and basic instructions are grouped together and scheduled according to the "discovered parallelism."
Instruction parallelism is scaled by the compiler. "The object is to achieve multiple things with each clock cycle," Kloker said, "but you don't want 'no ops' or alignment issues." The StarCore uses a basic orthogonal 16-bit instruction — it's "compiler friendly," Kloker said — and will avoid alignment issues in the instruction pipeline with "prefix." This results in a compiled code density that is comparable to the best embedded RISC processor, Kloker said.
Since StarCore is intended to be a scalable architecture, actual hardware implementations can vary in the amount of parallelism they embody. Some implementations can have two MACs, like Lucent's DSP16000, for example, while others may have more.
TI's C6X has eight parallel execution units; its compiler is always looking for instructions to parallelize. "The problem with this approach is that it's useful for only those applications which are not power-sensitive," said Kloker. "It is not designed for scalability." With its compiler-driven approach, the StarCore 400 "is one of the first to do a practical job of scalability," he said.
In all cases, the computational resources will include data ALUs and registers, address ALUs and address registers, and instruction registers and instruction set accelerators. Since everything running through the machine stems from memory accesses, data bandwidth and instruction bandwidth will be the most important factors governing performance. The video and multimedia capabilities demanded by MPEG-4 and third-generation wireless phones would demand billions of MAC operations per second, the StarCore team acknowldged. The StarCore 440, a version of the architecture due in the second half of 1999, is predicted to deliver 1.2 billion DSP MACs per second, with 4 MACs per tick of a 300-MHz clock. With a 128-bit VLES instruction grouping (and two instructions used for MACs) the machine will actually execute 6 instructions per clock — 3,000 RISC Mips. The data word is 16-bits wide, with a 32-bit address word and 40-bit accumulators. The machine takes in 8 data words per clock, or 4.8 Gbytes per second. Implemented in 0.13-micron CMOS, the implementation is expected to consume less than 0.1 mA per DSP MAC at 1.5 V.
Also on Wednesday (Oct. 14), Analog Devices unveiled its TigerSharc architecture, which it said will perform 2 billion 16-bit MAC operations per second — theoretically, 8 MACs per tick — with a 250-MHz clock.
ADI's TigerSharc device actually has two computational units, each capable of a 32 x 32 multiply, and each computational fed by a 128-bit wide data bus. Three 128-bit buses actually shuttle across this DSP (two for data, one for instructions), making for an aggregate bandwidth of 12 Gbytes/s. Each second, the machine cranks through the equivalent of two billion 16-bit MACs, or 500 million 32-bit MACs, or 8 billion 8-bit operations.
"That's the beauty of this architecture," said Gerry McGuire, 32-bit DSP product line manager for ADI (Norwood, Mass.). The TigerSharc is impervious to data types, he said. It will accept 8-, 16- or 32-bit data, and adapt accordingly on the fly. The execution unit will scale automatically, instruction-by-instruction, said McGuire.
"Mixing and matching the data types allows the architecture to be tuned to the precision of the task at hand," he said. Cellular basestation applications require high bandwidth to support new media types; remote access servers must support multiple channels on one chip to carry more subscribers at lower costs; and new air interface standards, vocoders and modems will demand programmability. Modulation and demodulation use 16-bit data types, as does voice coding. But filtering, echo cancellation, and line equalization can use either 16- or 32-bit data types. Forward error correction such as Viterbi detection can use either 8- or 16-bit data types, but new convolution codes use 32 bits. Thus, it's important for a processor to handle all of these.
While TigerSharc embodies a wide array of parallel resources, its programming model is closer to a short-pipeline RISC machine — an architecture Analog Devices calls "static superscalar." There are no ambiguities for the compiler, said McGuire. The machine has a very "deterministic execution flow," McGuire said. Everything is accomplished on the instruction level. There are 128 general purpose registers, and a programmer decides how these are used, he said. There is a two-cycle delay for every instruction, i.e., it takes two cycles for the results of a computational instruction to appear in the output register. This makes use of TigerSharc relatively easy for an assembly language programmer, McGuire said.
ADI offers plenty of high-level language support for programmers. With a 32-bit architecture, for example, it offers orthogonal addressing, no special hardware modes and user-determined branch prediction. Nevertheless, the TigerSharc's organization is intended to make things easy for the assembly language programmer. "Sometimes you gotta get your hands dirty," McGuire said about the need for assembly language programming. While futurists insist that DSP programming will depend more and more on C or C++, the highest DSP performance achieved to date has depended on program tweaking and tuning in assembly.
Samples of TigerSharc, produced in a 0.25-micron process, will be available in 1999, and CMOS scaling will allow the device to be clocked at higher speeds, according to ADI. The architecture will produce over 5 billion 16-bit MAC operations per second from a 0.1-micron process running at 600 MHz, McGuire said.
In an effort to keep the Microprocessor Forum announcements in perspective, Texas Instruments called editors in advance of the forum to say that its C6X DSP was the only high-performance architecture currently shipping, and the only one with significant design wins. "This is an architecture that's good for C," said Henry Wiechman, product marketing manager for TI (Dallas). "We've continued to improve the compiler." Efficiencies of 90 percent , and sometimes 100 percent, can be obtained for the C6201, Wiechman said. Moreover, tooling is being developed which will help improve the code size for memory-constrained applications.
Based on the sale of linkers, machinery that connects code-development platforms to target hardware, TI said over 1,000 designs are currently in progress using the C6X architecture, and new designs are being added at a rate of 10 per day. The Code Compiler Studio, introduced at DSP World last month, will aid C6X programmers with increased efficiency and with rapid time to market, Wiechman said. |