SI
SI
discoversearch

We've detected that you're using an ad content blocking browser plug-in or feature. Ads provide a critical source of revenue to the continued operation of Silicon Investor.  We ask that you disable ad blocking while on Silicon Investor in the best interests of our community.  If you are not using an ad blocker but are still receiving this message, make sure your browser's tracking protection is set to the 'standard' level.
Politics : Formerly About Advanced Micro Devices -- Ignore unavailable to you. Want to Upgrade?


To: Epinephrine who wrote (93852)2/17/2000 12:51:00 PM
From: kash johal  Read Replies (3) | Respond to of 1571043
 
Thread,

More willy stiff from aceshardware:

Seems like a pretty evenhanded review. Overall Willy have some pretty neat items. Just may not be any better than AThlon on IPC.

So will come down to a Mhz race. I don't recall who said: "the more things change the more they stay the same".

Willamette Under The Microscope

By Johan De Gelas
Thursday, February 17, 2000

"Intel has stunned the industry with their demonstration of a 1.5 GHz Williamette CPU. The speed that this handpicked chip attains is astonishing and has fueled quite a bit of speculation about its future. We thought it would be interesting to unravel some of the mysteries surrounding this chip which has been a mystery for so long; we're up for the challenge.

Clockspeed

Intel is betting everything on clockspeed. A 20-stage pipeline is the longest pipeline ever implemented in a mainstream processor. Such a long pipeline means that some instructions take 20 cycles before they write back their results. The work that must be done to finish one instruction is split up into 20 clockcycles, which means very little is accomplished within a any single clockcycle, and therefore, the clockcycle can be very short. The Athlon, by comparison, has a 10-stage pipeline, while the Pentium III has a 12 stage pipeline (read all about clockspeed and pipelines here).

In the same article we also make note of some significant drawbacks which plague such an approach. These drawbacks include longer latencies and higher branch misprediction penalties.

Longer Latencies

In modern superscalar CPUs, instructions wait to be issued in a buffer. The number of clockcycles required to return a result, or the time an instruction spends in the execution units (FPU or ALU) is what we refer to as latency in this article. As mentioned here, a longer pipeline means longer instruction latencies. If there is an instruction in the buffer that needs the result of a previous instruction, it will have to wait longer before it can be issued.

For example, FADD requires four clockcycles on the Athlon. Let us assume that it takes six clockcycles on the Pentium IV (Willamette). If we are executing this sort of code:

Instruction 1: A = value B + value C
Instruction 2: D= A + 1

Instruction 2 can only be issued 4 clocks after the first instruction on the Athlon, and 6 clocks after the Pentium IV. Therefore, the instruction control unit (which issues the instructions out of order), has to find at least four instructions in its buffer that don't depend on the result of an executing instruction to maintain an IPC (Instruction Per Clockcycle) of 1. Of course, an IPC of 1 is very low and a processor like the Athlon is built to execute an average of 2 x86 instructions per clockcycle.

This means that 3 micro-ops (one x86 instruction = +/- 1.5 micro-ops) must be executed each clockcycle, or the Instruction Control Unit must find at least 12 (4 clockcycles, 3 instructions) instructions that it can issue. Now in the case that the Pentium IV has 50% higher latencies, you may add 50% to each of those numbers.

This example is, of course, nothing more than an example, and my numbers are most likely not correct. The point is that higher latencies make it harder to maintain high IPCs. The instruction streams of most programs today contain very little parallelism, or instructions that do not depend on each other. The longer the latencies, the harder it gets to issue instructions each clockcycle. A larger buffer helps to some extent. Intel doesn't tell us much how big this buffer is, just that it is "significantly deeper." No wonder, the buffer on the PIII was pretty small. It was only 40 micro-ops, the equivalent of +/- 20 x86 instructions, while the Athlon has a buffer with room for 78 micro-ops.

Higher Branch Prediction Penalties

This one is pretty obvious. If the CPU takes the wrong branch, a deeply pipelined CPU will be doing useless work for more clockcycles than a CPU with a short pipeline. The branch prediction penalty will be much higher (read more about branch prediction here). A better branch prediction unit can help, but such a unit might severely hurt clockspeed rampability.

15 to 20 percent of x86 instructions found in a typical program are branches. It is very clear that this will have a negative impact on the performance of Willamette. To minimize such damage, Intel has improved the branch prediction used in Willamette by "combining all currently available prediction schemes."

FPU

The stack-based x87 FPU with its eight architectural registers has a very limited future. It is the worst FPU of all CPU's currently on the market. AMD's Athlon tried to solved the x87 FPU problem with register renaming, essentially a huge secret register file (88 entries), and an out-of-order triple-issue FPU. The Athlon FPU performed up to 40% better than the PIII's, clock for clock. This was a huge improvement, but Athlon's FPU power is still nowhere near Alpha's FPU power.

That is why AMD made the decision to add a RISC-like large flat floating point register file and three-address floating-point instructions for the 64-bit x86-64 instruction set to be implemented in the K8 (read Paul DeMone's excellent article).

For the Pentium IV or Willamette, Intel has decided to promote ISSE2 as an alternative to x87. The x87 FPU performance of the Pentium IV will not be very high, clock for clock.

The evidence:



The FXCH instruction, used to shuffle the data around in the stack based model, is no longer instruction that only takes a bit of decode bandwidth but no execution resources, as it was on the PIII. Intel dissuades developers from using the FXCH.
The latencies of FADD and FMUL are longer and some FPU instructions have a latency of 10 cycles!
The FPU has two functional units, which is less than the Athlon! One for FADD and FMUL, one for FSTORE and FLD. In other words, the Athlon can do theoretically one floating point addition and one multiplication per clockcycle, while the Pentium IV can only do one multiplication or one addition.

Intel is concentrating on ISSE2: No less than 144 new instructions. If Intel can rally enough support behind ISSE2, the Willamette FPU performance will blow everybody else out of the water, as the ISSE2 FPU performance is vastly superior to the x87 in single precision. A dual ISSE2 unit at 1.4 GHz, the clockspeed at which the fastest Willamette will ship (Q4 2000), would boast a peak of no less than 2 x 4 (SIMD) x 1.4 GHz = 11.2 billion floating-point operations per second (or FLOPS)! The x87 FPU would deliver a measly 1.4 GFLOPS peak.
For comparison, if AMD manages to introduce a 1.2 GHz Athlon by the time the Willamette ships, the x87 FPU will peak at 2.4 GFLOPS, while the 3DNow! units will peak at 4.8 GLOPS.

All of these numbers are theoretical, of course, but it gives you an idea of how powerful such a SIMD implementation can be.

Clock Speeds.

There have been claims that the ALUs on the Willamette are running at 3 GHz. Intel confirmed that the ALU (Arithmetic Logic Unit) of the 1.5 GHz Willamette is double-pumped (see the developers manual), but just as a 66 MHz AGP 2x bus doesn't run at 133 MHz, a double-pumped 1.5 GHz ALU does have not to run at 3 GHz (if it was, wouldn't Intel have said so?).

"Double-pumped" most likely refers to the ability to use both the rising and the falling edges of the clockcycle to trigger gates. You can understand this special ALU by imagining that it is a sort of a double unit, with one that works with the falling edge, and one that works with the rising edge of the clockpulse. Although it performs like a 3 GHz ALU, it probably isn't a real 3 GHz ALU. This is speculation on my part, but there is more.

(Read here how triggering gates is the way the ALU processes data. Look for the section: "Your first microprocessor.")

Note this comment from Intel:

Double-pumping allows very low latency integer operations to be completed at a rate higher than the processor clock frequency' and 'Integer ALU clocked at twice the frequency: Reduced latency increases the performance for certain integer operations'. That seems to indicate that double pumping is essentially used to lower the latencies of the instructions, to compensate for the long pipeline.
Trace Cache

The reason why the trace cache is there seems to be clock speed and only clock speed. Normally, a trace cache assists the fetcher. This is only necessary when you need to feed a very wide superscalar CPU like Intel's Itanium. However, in case of Willamette, it seems that clock speed is the real reason. More info in the article we posted earlier:

The problem of taking branches is that they make the fetcher jump around in the I-cache. The fetcher can only really be effective if instructions are located contigiously within the L1 cache. As you know, fetching is the first thing a CPU does, and x86 CPUs fetch in one cycle. If the fetcher has to fetch instructions in noncontiguous locations, it can not do it in a single clockcycle, as that would mean lower clock speeds (more work in one clockcycle = lower clockspeeds).

The fetcher requires more than one clockcycle, and the number of instructions fetched per clockcycle is resultantly lower.

Enter trace cache. Trace cache is a special addition to the instruction cache that tries to find out how the branches of a program behave. A trace cache will try to store the instructions in the sequence they are executed. So, if a branch makes a jump from instruction X at location 200 (in the I-cache) to instruction Y at location 300 (in the I-cache), the trace cache will store the second instruction Y in a location right behind the previous instruction X.

To be more scientific, a trace cache stores dynamic instruction sequences. The Trace caches 'traces', or follows the instructions sequence as the program executes. The next time the program executes that sequence of instructions, the fetch unit can fetch the 'jumping' instructions contiguously from the nicely ordered trace cache instead of the "chaotic" I-cache. The average number of instruction fetched per clockcycle will be higher, thus keeping the execution units well fed.

From the Intel Developer's manual: "It reduces the pipeline bubbles that are caused by branch mispredictions where the front end of the processor has to be redirected to a new decode point."
Another interesting tidbit:

"The use of the trace cache in the Willamette processor microarchitecture eliminates the need for a superscalar decoder and removes the instruction decoder from the main execution loop".

In simplier terms, decoding becomes easier as the trace cache makes sure that instructions are in the right sequence (grouped together in groups of independent instructions).
Last minute edit: Let Paul DeMone's more technical explanation enlighten you.

RDRAM

The quad-pumped 100 Mhz bus between the chipset and the Willamette is 64-bits wide, and delivers 3.2 GB/s of bandwidth. Intel also confirmed that the Willamette will be using RDRAM.

Choosing RDRAM was a good choice. Why? It is true that 800 MHz RDRAM, which features 1.6 GB/s of bandwidth, doesn't perform well when paired with a PIII and i820 chipset, but you shouldn't forget that the incredible amount of bandwidth RDRAM is capable of is severely crippled by the slow 133 MHz FSB bus on the PIII.

When we tested RDRAM with a 100 MHz FSB and then compared it with our readings at 133 MHz, we noticed a five to ten percent increase in overall performance (10% in CPUmark, 5% in winstone). You can imagine what a dual RDRAM (2 x 16 bit) channel will do when it can deliver its full 3.2 GB/s to the FSB of the processor.

Conclusion

The Willamette is a superpipelined processor which will, if Intel executes well, deliver higher clockspeeds than any processor before. As indicated here at The Register, The Willamette will start at 1.3 GHz.

We have tried to point out in this article that all the clockspeed craziness comes with a price. The performance of Willamette, clock-per-clock might not be much higher than the PIII. Albert Yu, a senior VP at Intel Corporation, indicated that Willamette will be 30% faster than the Coppermine, which will reach 1 GHz in the third quarter of this year.

As the Willamette is supposed to start at 1.3 or 1.4 GHz, that could mean that the Willamette is mostly faster because it boasts higher clockspeeds, not higher IPC. In other words, AMD's Thunderbird should be able, to compete well with the Willamette clock for clock. That is unless, however, Intel makes sure support for the new ISSE2 instructions is superb. In that case, Intel's new sibling will blow every other CPU out of the water. Intel has supported developers very well in the past, but it takes time to build up software support and rewrite applications.

Nevertheless, the Willamette introduces some very interesting concepts like double pumped ALUs and trace caching, which greatly enhance the "x86 decoding/micro-op core" architecture. The Willamette will reach incredibly high clock speeds, and has quite a few tricks in its sleeve to soften the problems of such a long pipeline. It will be very hard for the competiion to keep up with Willamette's clockspeed. That is the future, however, "



To: Epinephrine who wrote (93852)2/17/2000 9:42:00 PM
From: Process Boy  Respond to of 1571043
 
e - <Your unique position here (having inside information) will undoubtedly put you in uncomfortable situations like this in the future. I hope you handle them better next time.>

I understand this, and you are correct in this aspect.

Some situations are a tough call. I went ahead with the way I did it. May not have been the best way, but silence was tough too after the initial question was repeated.

If you do follow back, I did ignore the question at the first query.

Enough. I'd like to drop this. I told him I was sorry for engaging him. But your observation does have merit. Thank you for the level response.

PB