SI
SI
discoversearch

We've detected that you're using an ad content blocking browser plug-in or feature. Ads provide a critical source of revenue to the continued operation of Silicon Investor.  We ask that you disable ad blocking while on Silicon Investor in the best interests of our community.  If you are not using an ad blocker but are still receiving this message, make sure your browser's tracking protection is set to the 'standard' level.
Politics : Formerly About Advanced Micro Devices

 Public ReplyPrvt ReplyMark as Last ReadFilePrevious 10Next 10PreviousNext  
To: Paul Engel who wrote (39714)10/20/1998 5:41:00 PM
From: Maverick  Read Replies (1) of 1576635
 
K7 beats Katmai
Microprocessor Forum: Battle lines
drawn for next-generation MPUs

By EE Times staff
EE Times
(10/18/98, 10:23 p.m. EDT)

SAN JOSE, Calif. — The end of the century will be marked by
architectural wars as intense as the battle between RISC and
CISC, but far more complex.

That was the repeated message from a series of major
microprocessor papers at the latest Microprocessor Forum. Intel
Corp., IBM Corp., Advanced Micro Devices Inc. and Compaq
Computer Corp. each unveiled details of their new flagship CPUs,
revealing both common trends and profound differences in the
pursuit of power.

Intel gave the first public description of some of the inner workings
of Merced, the first IA-64 implementation. Compaq answered with
a paper on EV7, to be known as the Alpha 21364, the processor
many regard as the only head-on competitor Merced will face. But
IBM disputed that view, unveiling an unsuspected development: a
single-chip multiprocessor based on an advanced PowerPC core.
And AMD detailed the K7, a chip aimed more at Intel's IA-32 line
but one on which AMD will have to rely in the server market as
well.

The four chips demonstrate some commonality in thinking, including
a trend to move very large L2 caches onto the die with the CPU
and a continuing search for more external bus bandwidth. And, of
course, all the designs are counting on advanced processes to
render their 300-MHz and above speeds possible and their
enormous transistor counts affordable. But the differences among
the designs were profound.

Microprocessor Report editor-in-chief Keith Diefendorff offered a
taxonomy of the architectural search for power in a closing analysis.
Warning that the traditional source of speed — faster clock rates
— is about tapped out, Diefendorff said the industry will
increasingly turn to parallelism, which can be found at a number of
levels: process thread, control thread, instruction, data and bit. The
major approaches to emerge thus far have been at the process,
instruction and data levels.

Data-level parallelism is exploited in AMD's K7 and Intel's Katmai.
Both Pentium-level machines offer aggressive superscalar design
but focus on elaborate second-generation media processing units
with extensive single-instruction, multiple-data (SIMD) capabilities.

While Katmai remains a relatively stock Pentium II design with an
extended SIMD unit, the K7 is a ground-up re-architecting of the
IA-32 instruction set. With three general-execution units and three
address-calculation units, K7 will be “a wider superscalar machine
than the Pentium II,” said AMD's director of K7 engineering, Dirk
Meyer.

On the floating-point side, particularly for X87 code, the difference
between the K7 and the Pentium II “is even more apparent,” he
said. The K7 has two double-precision, fully pipelined X87 data
paths, compared with one for the Pentium II — which, according
to Meyer, is not fully pipelined for double precision. “If you want to
do an X87 instruction on the Pentium II, in one sense you cycle the
instruction through the pipe twice. So in some ways, for
double-precision X87, the K7 has four times the peak execution
rate.”

Because the 3DNow multimedia instructions are SIMD, in every
register there are two single-precision numbers, and two pipelines
can operate on that data. “So essentially, the K7 provides twice the
performance [available on the Pentium II] for X87 code,” Meyer
said.

That more aggressive superscalar design is more demanding on bus
bandwidth and decode/dispatch logic. AMD's answer is a
200-MHz system bus. Licensed from Digital Equipment, the EV6
bus is the same as that used for the Alpha 21264 processor, now
owned by Compaq Computer Corp.

The K7 will be placed on a daughtercard that is mechanically
compatible with, though electrically different from, the Intel Slot 1
design, and Compaq could offer the Alpha on daughtercards to the
commercial market, ensuring swappability with the K7 for system
OEMs.

If the EV6 bus can supply K7's caches with enough data, the
bandwidth problem will pass on to the processor's decode and
dispatch logic. Here, AMD has taken two major steps. First,
IA-32 instructions are decomposed not into rudimentary RISC
operations but into what Meyer called macro-ops — slightly more
complex steps that can include two rudimentary operations. The
macro-ops are then heavily buffered throughout the decode,
reservation and dispatch process to avoid stalls.

“One thing that differentiates the K7 from essentially any other X86
processor is the deep buffers we have in place . . . deep, deep
instruction and memory schedulers and lots of memory and address
data buffering,” Meyer said. “The more bandwidth you have, the
more buffering you need to sustain the bandwidth. We went to
fairly great lengths to make sure the machine would have the
buffering it needed to uncover all the work that is available in
auto-decode.”

Compatibility risk
The design gives AMD claims to very high performance, especially
at the claimed target clock rates of over 500 MHz. But it also
exposes the company to a profound risk: Both the media
instructions and the system bus will be incompatible with Intel's
product line.

While AMD advances along superscalar lines with K7, Intel will
roll out the first of the IA-64 machines, taking its own new direction
in the pursuit of instruction-level parallelism. As has been widely
publicized, IA-64 depends on compiler technology to find and
make explicit the hidden opportunities to execute instructions in
parallel. The job of the hardware is to increase the opportunities for
the compiler through such devices as predicated execution and to
provide as many execution units as the compiler can use.

Intel gave some visibility into that process at the Microprocessor
Forum, lifting the covers a bit on the internals of Merced. On the
floating-point side, Merced will have an independent 128-register
FP register file, two extended-precision multiply-accumulate units
(MACs) and two single-precision MACs, according to a paper by
Intel corporate vice president Stephen Smith. Those units, in
combination with IA-64 compilers, will give Merced more than 20
times the performance of a Pentium Pro on 3-D graphics, Smith
said.

The high-level view offered by Smith contained functional blocks
for instruction fetch and decode, cache, bus control, translation
lookaside buffer, floating-point unit, integer unit, IA-64 control and
IA-32 control.

Discussion of the IA-32 block
revealed more of Intel's
approach to IA-32
compatibility. Merced will
have a mode bit to put it in
IA-64 or IA-32 mode. In the
IA-32 mode, the instruction
fetch will work normally, but
instead of going directly to the
execution units, IA-32
instructions will go to the
IA-32 control hardware, where they will cause operations to be
dispatched to the Merced execution units. Thus, Smith claimed, the
small control block that takes in IA-32 instructions and emits
operations to the execution units is the only hardware overhead for
IA-32 binary compatibility.

Merced will have three levels of cache. An L0 cache will be closely
tied to the execution unit. It will be backed by on-chip L1 cache.
The multi-Mbyte L2 cache will be housed on a separate die.

Smith also said the processor will be housed in a new cartridge,
which will contain both the CPU and Merced's caches.

Merced, which is due in mid-2000, will be followed in late 2001 by
a previously announced processor code-named McKinley. Smith
said that two additional IA-64 devices are on track to follow.

“We will move forward to 0.13-micron technology with a product
code-named Madison” around 2002, he said. It will aim at
high-end workstation and server applications. After that will be an
IA-64 processor, code-named Deerfield, that Smith said will be
“billed as a price/performance processor.”

No other vendor appears in a position to match the research and
development investment that Hewlett-Packard Co. and Intel have
poured into explicit instruction-level parallelism. But two vendors
will challenge the performance of IA-64 by exploiting process-level
parallelism. Both Compaq and IBM described fast, superscalar
RISC machines that are aimed at tightly coupled multiprocessing.

On one 3.5-cm2 die, the Alpha 21364 will integrate the Alpha 264
core with a Direct Rambus memory controller, 1.5 Mbytes of L2
cache and a network interface to support direct links to four
processors and I/O. Compaq senior consulting engineer Peter
Bannon estimated the EV7 will deliver 70 SPECint95 and 120
SPECfp95 at speeds above 1 GHz and will appear in systems late
in 2000, about six months behind Merced.

Bannon claimed the aggressively superscalar EV7 — with four
integer and two floating-point execution units — will achieve
instruction parallelism similar to Merced's on actual code. But
Compaq is counting on another hardware feature to give it an
advantage at the process level: an on-chip network interface for
direct processor-to-processor connections in multiprocessing
systems.

IBM leap
The port will offer 10-Gbyte/s bandwidth per processor, according
to Bannon, with only 15-ns latency between processors. It will
offer out-of-order messaging between processors and adaptive
routing through multiprocessor configurations.

IBM, meanwhile, is taking the interprocessor bandwidth race a
major step further. Like the EV7, IBM's GigaProcessor will
incorporate a full memory and I/O subsystem with about 2 Mbytes
of cache, a memory controller and a high-speed network/processor
interface on board. IBM will also put multiple iterations of a new
PowerPC core on the chip.

“We are substantially far along on the project,” said Charles
Moore, a senior technical staff member at IBM. “We consider our
primary competitor to be McKinley,” Intel's second-generation
IA-64 chip.

By pulling several CPU cores onto a single die, IBM hopes to
bypass much of the latency and bottlenecking associated with
process-level multiprocessing, thereby accomplishing at the process
level the same sorts of advantages that IA-64 seeks at the
instruction level: increased parallelism without substantial losses to
stalls or control overhead.

The approaches will all charge into the high-end server market
between mid-1999 and mid-2000. OEM support may push the
IA-64 to the fore, said Linley Gwennap, editorial director of the
Microprocessor Report. “IA-64 will dominate the high-end
market; Intel's OEM support can't be denied.”

Alpha could be the toughest competition for IA-64, but “it's
imperative it outperform the IA-64,” Gwennap said. “Alpha has to
maintain a performance lead, or Compaq will pull the plug.”

As for the RISCs, the 128 registers built into IA-64 chips like
Merced — and the memory effects and load dependencies that
RISCs take on — will give IA-64 a 2x or 3x advantage, said
Michael Mahon of Hewlett-Packard. But the shift will occur over
“a transition period that encompasses many years.”

— By Ron Wilson, David Lammers and Alexander Wolfe. With
additional reporting by Rick Boyd-Merritt
Report TOU ViolationShare This Post
 Public ReplyPrvt ReplyMark as Last ReadFilePrevious 10Next 10PreviousNext