K7 beats Katmai Microprocessor Forum: Battle lines drawn for next-generation MPUs
By EE Times staff EE Times (10/18/98, 10:23 p.m. EDT)
SAN JOSE, Calif. — The end of the century will be marked by architectural wars as intense as the battle between RISC and CISC, but far more complex.
That was the repeated message from a series of major microprocessor papers at the latest Microprocessor Forum. Intel Corp., IBM Corp., Advanced Micro Devices Inc. and Compaq Computer Corp. each unveiled details of their new flagship CPUs, revealing both common trends and profound differences in the pursuit of power.
Intel gave the first public description of some of the inner workings of Merced, the first IA-64 implementation. Compaq answered with a paper on EV7, to be known as the Alpha 21364, the processor many regard as the only head-on competitor Merced will face. But IBM disputed that view, unveiling an unsuspected development: a single-chip multiprocessor based on an advanced PowerPC core. And AMD detailed the K7, a chip aimed more at Intel's IA-32 line but one on which AMD will have to rely in the server market as well.
The four chips demonstrate some commonality in thinking, including a trend to move very large L2 caches onto the die with the CPU and a continuing search for more external bus bandwidth. And, of course, all the designs are counting on advanced processes to render their 300-MHz and above speeds possible and their enormous transistor counts affordable. But the differences among the designs were profound.
Microprocessor Report editor-in-chief Keith Diefendorff offered a taxonomy of the architectural search for power in a closing analysis. Warning that the traditional source of speed — faster clock rates — is about tapped out, Diefendorff said the industry will increasingly turn to parallelism, which can be found at a number of levels: process thread, control thread, instruction, data and bit. The major approaches to emerge thus far have been at the process, instruction and data levels.
Data-level parallelism is exploited in AMD's K7 and Intel's Katmai. Both Pentium-level machines offer aggressive superscalar design but focus on elaborate second-generation media processing units with extensive single-instruction, multiple-data (SIMD) capabilities.
While Katmai remains a relatively stock Pentium II design with an extended SIMD unit, the K7 is a ground-up re-architecting of the IA-32 instruction set. With three general-execution units and three address-calculation units, K7 will be “a wider superscalar machine than the Pentium II,” said AMD's director of K7 engineering, Dirk Meyer.
On the floating-point side, particularly for X87 code, the difference between the K7 and the Pentium II “is even more apparent,” he said. The K7 has two double-precision, fully pipelined X87 data paths, compared with one for the Pentium II — which, according to Meyer, is not fully pipelined for double precision. “If you want to do an X87 instruction on the Pentium II, in one sense you cycle the instruction through the pipe twice. So in some ways, for double-precision X87, the K7 has four times the peak execution rate.”
Because the 3DNow multimedia instructions are SIMD, in every register there are two single-precision numbers, and two pipelines can operate on that data. “So essentially, the K7 provides twice the performance [available on the Pentium II] for X87 code,” Meyer said.
That more aggressive superscalar design is more demanding on bus bandwidth and decode/dispatch logic. AMD's answer is a 200-MHz system bus. Licensed from Digital Equipment, the EV6 bus is the same as that used for the Alpha 21264 processor, now owned by Compaq Computer Corp.
The K7 will be placed on a daughtercard that is mechanically compatible with, though electrically different from, the Intel Slot 1 design, and Compaq could offer the Alpha on daughtercards to the commercial market, ensuring swappability with the K7 for system OEMs.
If the EV6 bus can supply K7's caches with enough data, the bandwidth problem will pass on to the processor's decode and dispatch logic. Here, AMD has taken two major steps. First, IA-32 instructions are decomposed not into rudimentary RISC operations but into what Meyer called macro-ops — slightly more complex steps that can include two rudimentary operations. The macro-ops are then heavily buffered throughout the decode, reservation and dispatch process to avoid stalls.
“One thing that differentiates the K7 from essentially any other X86 processor is the deep buffers we have in place . . . deep, deep instruction and memory schedulers and lots of memory and address data buffering,” Meyer said. “The more bandwidth you have, the more buffering you need to sustain the bandwidth. We went to fairly great lengths to make sure the machine would have the buffering it needed to uncover all the work that is available in auto-decode.”
Compatibility risk The design gives AMD claims to very high performance, especially at the claimed target clock rates of over 500 MHz. But it also exposes the company to a profound risk: Both the media instructions and the system bus will be incompatible with Intel's product line.
While AMD advances along superscalar lines with K7, Intel will roll out the first of the IA-64 machines, taking its own new direction in the pursuit of instruction-level parallelism. As has been widely publicized, IA-64 depends on compiler technology to find and make explicit the hidden opportunities to execute instructions in parallel. The job of the hardware is to increase the opportunities for the compiler through such devices as predicated execution and to provide as many execution units as the compiler can use.
Intel gave some visibility into that process at the Microprocessor Forum, lifting the covers a bit on the internals of Merced. On the floating-point side, Merced will have an independent 128-register FP register file, two extended-precision multiply-accumulate units (MACs) and two single-precision MACs, according to a paper by Intel corporate vice president Stephen Smith. Those units, in combination with IA-64 compilers, will give Merced more than 20 times the performance of a Pentium Pro on 3-D graphics, Smith said.
The high-level view offered by Smith contained functional blocks for instruction fetch and decode, cache, bus control, translation lookaside buffer, floating-point unit, integer unit, IA-64 control and IA-32 control.
Discussion of the IA-32 block revealed more of Intel's approach to IA-32 compatibility. Merced will have a mode bit to put it in IA-64 or IA-32 mode. In the IA-32 mode, the instruction fetch will work normally, but instead of going directly to the execution units, IA-32 instructions will go to the IA-32 control hardware, where they will cause operations to be dispatched to the Merced execution units. Thus, Smith claimed, the small control block that takes in IA-32 instructions and emits operations to the execution units is the only hardware overhead for IA-32 binary compatibility.
Merced will have three levels of cache. An L0 cache will be closely tied to the execution unit. It will be backed by on-chip L1 cache. The multi-Mbyte L2 cache will be housed on a separate die.
Smith also said the processor will be housed in a new cartridge, which will contain both the CPU and Merced's caches.
Merced, which is due in mid-2000, will be followed in late 2001 by a previously announced processor code-named McKinley. Smith said that two additional IA-64 devices are on track to follow.
“We will move forward to 0.13-micron technology with a product code-named Madison” around 2002, he said. It will aim at high-end workstation and server applications. After that will be an IA-64 processor, code-named Deerfield, that Smith said will be “billed as a price/performance processor.”
No other vendor appears in a position to match the research and development investment that Hewlett-Packard Co. and Intel have poured into explicit instruction-level parallelism. But two vendors will challenge the performance of IA-64 by exploiting process-level parallelism. Both Compaq and IBM described fast, superscalar RISC machines that are aimed at tightly coupled multiprocessing.
On one 3.5-cm2 die, the Alpha 21364 will integrate the Alpha 264 core with a Direct Rambus memory controller, 1.5 Mbytes of L2 cache and a network interface to support direct links to four processors and I/O. Compaq senior consulting engineer Peter Bannon estimated the EV7 will deliver 70 SPECint95 and 120 SPECfp95 at speeds above 1 GHz and will appear in systems late in 2000, about six months behind Merced.
Bannon claimed the aggressively superscalar EV7 — with four integer and two floating-point execution units — will achieve instruction parallelism similar to Merced's on actual code. But Compaq is counting on another hardware feature to give it an advantage at the process level: an on-chip network interface for direct processor-to-processor connections in multiprocessing systems.
IBM leap The port will offer 10-Gbyte/s bandwidth per processor, according to Bannon, with only 15-ns latency between processors. It will offer out-of-order messaging between processors and adaptive routing through multiprocessor configurations.
IBM, meanwhile, is taking the interprocessor bandwidth race a major step further. Like the EV7, IBM's GigaProcessor will incorporate a full memory and I/O subsystem with about 2 Mbytes of cache, a memory controller and a high-speed network/processor interface on board. IBM will also put multiple iterations of a new PowerPC core on the chip.
“We are substantially far along on the project,” said Charles Moore, a senior technical staff member at IBM. “We consider our primary competitor to be McKinley,” Intel's second-generation IA-64 chip.
By pulling several CPU cores onto a single die, IBM hopes to bypass much of the latency and bottlenecking associated with process-level multiprocessing, thereby accomplishing at the process level the same sorts of advantages that IA-64 seeks at the instruction level: increased parallelism without substantial losses to stalls or control overhead.
The approaches will all charge into the high-end server market between mid-1999 and mid-2000. OEM support may push the IA-64 to the fore, said Linley Gwennap, editorial director of the Microprocessor Report. “IA-64 will dominate the high-end market; Intel's OEM support can't be denied.”
Alpha could be the toughest competition for IA-64, but “it's imperative it outperform the IA-64,” Gwennap said. “Alpha has to maintain a performance lead, or Compaq will pull the plug.”
As for the RISCs, the 128 registers built into IA-64 chips like Merced — and the memory effects and load dependencies that RISCs take on — will give IA-64 a 2x or 3x advantage, said Michael Mahon of Hewlett-Packard. But the shift will occur over “a transition period that encompasses many years.”
— By Ron Wilson, David Lammers and Alexander Wolfe. With additional reporting by Rick Boyd-Merritt |