The Challenges are Changing for System Design Home > Technology > System Design > Articles > The Challenges are Changing for System Design By Ron Wilson, Editor-in-Chief, Altera Corporation
As process technology pushes forward from 40 to 28 nm, unavoidable scaling effects are changing the electrical characteristics of the basic elements—the transistors and interconnect wires—with which chip designers must work. These new transistor-level challenges, in turn, are bringing about changes in the architecture, implementation, and performance of system-level ICs. And those chip-level changes are creating a new landscape in which system designers must find their way.
Since the early days of the IC, process scaling has promised unbroken progress. In each new generation the minimum feature size would decrease by about a third, and hence the area occupied by the smallest transistor would shrink by a factor of two. At the same time maximum clock frequency for a given digital circuit would increase, and power consumption would drop, both by relatively stable and significant factors. In each new generation, then, chip designers could offer system developers new ICs that integrated more functions, ran faster, and still burned less power. But those days are gone.
Today, in order to reduce the size of transistors, process developers must present chip designers with a complex battery of tradeoffs involving speed, power, and cost. Chip designers must employ all their tools, including novel circuit designs, new architectural approaches, and fundamental changes in algorithms, to continue to offer greater performance and acceptable power at a competitive price. So far these techniques are effective, but they are not transparent to system designers. The 28-nm generation is firmly within a new era in system design: an era in which system designers must understand the challenges and decisions of the chip designers who supply the silicon.
The End of Simple Scaling The fundamental issue, process engineers say, is that as you shrink transistors and interconnect segments, their electrical characteristics change. As transistors get smaller, they no longer automatically get faster, and they begin to leak. So at 28 nm, transistors present designers not with assured progress on all fronts, but with a complex set of trade-offs. You can do a simple shrink of the 40-nm silicon oxynitride gate transistor and get a relatively low wafer cost. But you will not get either the highest available speed or the lowest power. You can improve the speed with strain engineering and reduce the leakage currents with a high-k/metal-gate (HKMG) stack, but at added cost. To a limited degree you can trade off power versus speed by changing operating voltage. You can trade off speed for leakage current by manipulating the transistor’s threshold voltage, either through design changes or by application of body bias. But there is no one sweet spot that delivers the fastest transistors at the lowest power for every application.
Figure 1. Sources of Transistor Leakage
Table 1. Main Sources of Transistor Leakage
Main Sources of Leakage Impact Mitigation Technique Subthreshold leakage (Isub) Dominant •Lower voltage •Higher voltage threshold •Longer gate length •Dopant profile optimization
Gate direct-tunneling leakage (IG) Dominant High-k metal gate (HKMG) Gate-induced gate leakage (IGIDL) Small Dopant profile optimization Reverse-biased junction leakage current (IREV) Negligible Dopant profile optimization
At the same time that transistors have become difficult, wires—seemingly the most basic element of circuits—have also become an issue. As the transistors get smaller, the wires in the first few layers of interconnect must get correspondingly narrower. But modern processes deposit the copper conductor for these interconnect lines in a narrow trench lined with a higher-resistivity barrier material, which is intended to keep the copper atoms out of the porous low-k insulation. As you shrink the process geometry, the trench gets narrower, but the barrier material doesn’t get much thinner, so there is even less space for the copper—hence, higher resistance. Worse, the dimensions of the copper filling are now close to the mean free path of electrons at working temperatures, so apparent resistivity of the copper is going up. Overall, interconnect resistance in the lowest layers is increasing, further eating away at the performance of the transistors and increasing power consumption.
Thus, the march of progress is presenting chip designers with an interesting offer—increased density, but complex trade-offs between speed, power, and cost. The result is that in many cases, chip designers will see little gain in intrinsic speed from their new process technology.
New Directions in IC Design The most unequivocal good news from 28-nm processes is the significantly higher transistor density, which translates directly into more gates, registers, and memory bits in a given area. In many applications, previous generations of system-level ICs had already integrated most of the functions that were appropriate for integration. So at 28 nm, chip designers are free to use all those extra transistors to get back the performance gains and power reductions that scaling didn’t deliver. Or they can opt for a significantly smaller die to reduce cost. Or they can choose some combination of these approaches, tailored to a particular set of applications.
In many cases, chip designers will choose to try to deliver the chip-level performance gains that system developers have come to expect. They can do this by spending transistors to buy performance, exploiting the inherent parallelism in the application.
Figure 2. Two Styles of Parallel Architectures
Exploiting Parallelism There are several ways to exploit parallelism. One of the least application-dependent is to simply lengthen the pipelines of the execution units in the design, exploiting instruction-level opportunities to have several instructions in flight at once. This approach is particularly effective for accelerators that work on the inner loops of numerical algorithms, in digital signal processing for instance. But in general computation, studies have suggested that consistently trying to keep more than three instructions in process at one time invites diminishing returns.
A closely-related approach, often used today in packet-processing applications, is to create a pipeline of function-level blocks. Here, data flows through a chain of processors, each of which performs a specific function on the data. Such pipelines tend to be very application-specific, and if the sequence of functions is either state- or data-dependent, the design can become implausibly complex as designers attempt to make the operation of the pipeline change during execution.
The most fashionable approach to parallelism today is multiprocessing, in which many processors pick up data and tasks from a shared memory and work in parallel. There are many variations on this theme. Single-instruction, multiple-data (SIMD) clusters, in which each processing unit loads different data but all use a common instruction stream, is often used in numerical accelerators. There are processor clusters in which each processor performs a fixed set of tasks, taking data from its input queue and depositing results in its output queue independently from the others. And there are fully dynamic multicore designs in which essentially identical processors pick up tasks and data from a job queue and execute them, working independently of each other except for inter-task communications.
All of these approaches can substantially increase the processing bandwidth of a system-level IC. Alternatively, chip designers can use parallelism to maintain performance but slow down the clocks on the individual processing units, allowing them to exploit lower operating voltages, low-leakage transistors, and similar power-saving techniques. But these ideas only work if the application provides sufficient parallelism and the processing units remain busy and free of interruptions. Those are significant challenges.
Perhaps the most significant is the problem of moving data onto and off the die fast enough to keep up with the added processing power. Chip designers might succeed in using increased parallelism to multiply the chip’s processing bandwidth. But neither I/O bandwidth nor memory bandwidth is scaling rapidly. It is increasingly possible to design a system-level IC that physically cannot move data fast enough to support its own processing speed.
In some ways resorting to parallelism exacerbates this problem, because of the intricate timing of DRAM. If a DRAM must support a random scattering of memory requests from different tasks on different processing units or cache controllers, the frequency of page misses will increase dramatically, and both the aggregate bandwidth and the predictability of DRAM transfers will be harmed. Highly-sophisticated DRAM controllers can ease this problem, but at the cost of increased uncertainty in latency for individual processes. Similarly, chip designers may turn to transceiver-based high-speed serial buses such as PCI Express® Gen3 or chip-to-chip connections, such as Interlaken, to boost I/O bandwidth. But a cacophony of requests from many processing units can undermine the best burst- or stream-oriented bus protocols, reducing the effectiveness of fast I/Os.
Reducing Power Just as chip designers can use the extra transistors from the 28-nm process to regain performance scaling, they can, perhaps ironically, spend transistor budget to reduce power. Improving efficiency begins with selecting the best process for the application, and choosing the best combination of transistor threshold voltage and operating voltage. Some designs will choose a single operating point for the entire logic portion of the chip. Others will operate different blocks at different voltages, and a very few will dynamically adjust the operating voltage on critical blocks based on performance requirements, temperature, and the intrinsic speed of the particular die. Similarly, many designs have access to transistors with different, or variable, threshold voltages and will select the transistor just fast enough to close timing for a block or even an individual timing path. Some designs also can turn off the clock or the power to an idle block. Such dynamic techniques are potentially very effective, but they cost transistors, and so have a power overhead, and they impose latencies during, for instance, clock-frequency or voltage transitions.
Figure 3. Static and Dynamic Power Comparison for Same Architecture on Same Process at 0.85 V and 1.0 V
The System Designer’s Task The major strategies pursued by chip designers—increased use of parallelism, reliance on high-bandwidth external memory, the search for greater I/O bandwidth, and aggressive power management—all have implications for the system developers who will use the ICs. These implications span algorithm design, software development, board design and, ultimately, enclosure and cooling design.
At the algorithm level, the growing reliance on parallel hardware brings changes. For example, compared to a faster single processor, parallel processing often significantly increases latency—the time between when an input appears at the chip and the time when its presence is reflected in the chip’s output. System designers have to understand the impact of these increased latencies on the behavior of their systems. Another potential issue is that parallel approaches, and multiprocessing in particular, may require changes to algorithms to expose the necessary data or task parallelism. Does the system design team have the necessary skill and documentation to alter basic algorithms, or can they rely on the IC vendor for help?
The growing need for I/O bandwidth brings its own challenges at the system level. Attempting to achieve the necessary throughput by ever-increasing numbers of ever-faster pins can drive board designs into intractable routing and signal-integrity problems. Accordingly, chip designers are increasingly turning to transceiver-based high-speed serial buses and links, including PCI Express Gen3 and Interlaken, for example. This trend is even being explored by DRAM vendors, with ideas such as the Serial Port Memory Technology group or Micron’s Hybrid Memory Cube program. In all these cases, high-speed serial links can significantly increase the bandwidth and reduce the transmission power for data moving on and off a die. But transceiver-based links have their own latency issues, and again the system designer must understand the impact of those latencies on system behavior.
Power management brings another set of challenges, as the burden of lower power has to a great extent been transferred from process engineers to chip designers and on to system designers. These challenges begin with the need to understand accurately the use profiles of the end-system. What tasks are necessary under what circumstances, and how much performance does each require? What will the duty cycles of the tasks and the individual functional blocks be in a given use mode? Answers to these questions will influence decisions as fundamental as process technology choices, and as dynamic as the power-management protocol. In the latter case, it is important for the system design team to understand just what the IC’s power-management hooks will demand of the system. Will the board need to supply multiple voltages? Will it have to change them, or switch them on and off, and if so how complex a controller will be necessary?
It is also vital to understand just how much the software will know about the use profiles, and how much of this knowledge it can apply. For example, a sophisticated power-management protocol at the chip level may actually waste, rather than reduce, power if the system software cannot accurately anticipate long periods of inactivity for major blocks and convey this information to the power-management system.
Conclusion In the 28-nm era, a system can only reach its performance, power, and cost requirements through positive interaction of semiconductor technology choices, chip design, and system design. On subsequent pages, we will explore how one vendor, Altera, has approached each of these three challenges and the closely-related issue of managing design-team effort, to ensure the best possible outcome for system designs.
--------------------------------------------------------------------------------
altera.com |