Parallel processing -- they hope to succeed where Chromatic failed (good luck!!)....... eetimes.com
Generic parallel architecture takes aim at ASIC turf
By Will Wade EE Times (10/04/99, 12:09 p.m. EDT)
FREMONT, Calif. — Claiming a new approach to on-chip parallel processing, startup Cradle Technologies Inc. will unveil an innovative general-purpose architecture this week that it hopes will replace ASICs across a broad swath of applications. The company is looking to set a watermark in semiconductor price/performance with a scalable platform of modules and programmable I/O links that shifts the burden of defining a chip to software.
Not only does the design promise lots of horsepower, but the economies of scale that come with producing a single processor core, replicated many times, mean low costs. "We predict that the cost for a gigaflop of computing power will drop from $300 in an ASIC to just $5 in a Universal Microsystem design," said Satish Gupta, president, chief executive officer and co-founder of the Fremont company, which will detail its approach this week at the Microprocessor Forum technical conference in San Jose, Calif.
However, at least one analyst cites thorny problems underlying parallelism that may prevent users from eking out that promised performance. What's more, critics say, because Cradle seeks to establish a general-purpose architecture, its processor could fail to establish itself in any specific market.
"We expect this platform to replace ASICs at the mid- and high end of the market," said Gupta. "All the functionality within our Universal Microsystem [UMS] chip is defined in software."
Cradle plans to deliver to customers both the chip and a portfolio of intellectual-property algorithms that will let them tailor the device to their needs. The UMS chip also relies on banks of off-chip EPROM that can be programmed to tell the processors how to work. Depending on the application, Gupta said a system would use between 500 kbytes and 2 Mbytes of EPROM to store the processor control codes.
The building block behind that software is a four-core module, which comes with a dedicated SRAM memory cache and can deliver 3 Gflops. Each core contains both a RISC engine and two digital signal processing blocks. A chip can be made up of multiple modules, with die size the only limiting factor.
"If you need more power, you can put in another processor module," said Gupta. "With UMS, we can offer as much as 15 Gflops in a single chip."
Cradle eventually plans to create several products differentiated by the number of processor modules. The first of them will be produced at the 0.25-micron level, and the company has signed with IBM Corp. to provide foundry services.
While Cradle's approach appears to promise significant processing clout, Peter Glaskowsky, senior analyst for MicroDesign Resources (Sunnyvale, Calif.), said users may not actually see that kind of performance. "In real applications, it is hard to break up tasks neatly in order to utilize all the processor cores," he said. "The chip may offer all that power, but it is not certain what kind of performance levels the end user will actually get."
Cradle is not the first company to follow the parallel path. Besides recent attempts such as the MAJC architecture from Sun Microsystems Inc. or products from Equator Technology Inc., Glaskowsky pointed to Chromatic Research Inc. as one of the first companies to attempt VLIW parallel processing — and one of the first to run into trouble with it. "Most people attempting this now think they have learned what not to do by looking at Chromatic," he said.
Chromatic's single-processor design used thousands of registers; its very long instruction word scheme attempted to fill each register with small instructions to execute. Just as current multiprocessor implementations can see limited performance if cores are underutilized, the Chromatic design also had difficulty keeping each register full.
That is exactly the problem Gupta hopes to sidestep by using a different type of parallel processing. Both Sun and Equator utilize a parallel compiler to break up tasks at the instruction level, but Gupta said that approach is too complex for most processors and can bog down performance. "The problem [of dividing applications into smaller tasks] is really challenging," he noted. "A parallel compiler can do a reasonable job, but it isn't very efficient, because it's too complicated."
Gupta said the UMS architecture relies on a technique known as natural parallelism, which breaks down tasks based on their function. For example, in graphics applications each polygon will be assigned to a single processor. After that polygon is placed, the core will take on the next polygon.
Similarly, Gupta said UMS is ideal for networking, because each data packet can be seen as a single task to be processed. "We have taken the problem several levels above the approach that uses parallel compilers," he said. "Data parallelism is present in many high-performance applications, and that allows us to use natural parallelism."
Glaskowsky agreed that such applications as networking or video systems may be an efficient use of the Cradle design. But he cautioned that Cradle is not going to do the work of breaking applications down into bite-size chunks for each processor. That task is left to the customers, and it will add to the effort required to use a UMS chip.
Besides multiple processor cores, the UMS architecture features several programmable input/output blocks. That means users can configure the chip to use whatever I/O format they need, and the control code is stored in the same external EPROM as the processor control code. "Having programmable I/O is key to making this chip part of a system, and not just a microprocessor," said Gupta. He described an open-format I/O block that can be configured for a variety of I/O protocols, including 1394, PCI, SCSI or USB, as well as controllers for LCDs and video output.
Weak link?
Yet the I/O design may turn out to be one of the architecture's weak links. "The I/O is really constrained, because the pin drivers don't support some of the really interesting stuff," said Glaskowsky. Though the pin drivers are configured for a range of power levels, he said that is probably not precise enough to use with USB, Ethernet or analog functions without also employing external level converters. These extra chips are not necessarily expensive or complicated to install, but adding them is an extra step in the design process.
Gupta conceded that this is one of the architecture's weaker points, but he said future generations will feature more precise controllers to address the issue.
"Some people will probably like the design, if their application can be easily broken up to fully utilize the multiple cores," Glaskowsky said. "What [Cradle] needs to do is focus on a few areas to prove they can successfully undercut the prices of competing products and deliver higher performance levels." |