SI
SI
discoversearch

We've detected that you're using an ad content blocking browser plug-in or feature. Ads provide a critical source of revenue to the continued operation of Silicon Investor.  We ask that you disable ad blocking while on Silicon Investor in the best interests of our community.  If you are not using an ad blocker but are still receiving this message, make sure your browser's tracking protection is set to the 'standard' level.
Politics : Formerly About Advanced Micro Devices -- Ignore unavailable to you. Want to Upgrade?


To: Scumbria who wrote (98716)3/16/2000 2:33:00 PM
From: Petz  Read Replies (1) | Respond to of 1571002
 
Scumbria, on 2x ALU. At Princeton, I had a professor, Bede Liu, whose research area was the use of asynchronous logic in computer designs. I think the Willy's ALU is pseudo-asynchronous. What I mean is this: all operations inside the ALU execute with a granularity of 1/2 a CPU clock cycle. Adding two numbers might take 1/2 a CPU clock if one of the numbers is 0, 1 CPU clock if one of the numbers is 1, 1.5 clocks in other cases. In other words the completion time might be data dependent. The ALU has a sequence of operations it has to do and it proceeds with the next operation on the same half clock cycle that it completed the last one.

If the ALU were designed in this way, it might just take a few more half-clocks to complete some operations as the CPU frequency rises to 2 GHz.

I haven't had time to read any of the disclosure on Willy, does the above description fit with the limited info that Intel has realeased?

Petz



To: Scumbria who wrote (98716)3/16/2000 3:03:00 PM
From: Hans de Vries  Read Replies (2) | Respond to of 1571002
 
Scumbria: RE: It is possible that 1/2 cycle ALU is out of balance with the rest of the pipeline, but I doubt it. However, if the ALU is the critical path, it will indeed be the limiting factor for MHz.

Clock skew is not an additional issue with the ALU, because it runs off the same clock as the rest of the design. It simply completes it's operation in one phase of the clock.


The Willamette has more and shorter pipeline stages as Coppermine which could do the slowest ALU operations in 1 cycle. This is not possible anymore. The slowest ALU operation takes about 1.5 cycle (not 0.5) The latter would mean that the ALU would be 3 times faster with the same process... Which is not the case.

The Willemette is the first processor where the pipeline stages are so short that the ALU can not perform it's task anymore within a single cycle.

One of the biggest bottlenecks in an x86 processor is the ALU in case of instructions which depend on the result of the previous operation like C=A+B; E=C+D; G=E+F; et cetera.

A 1 Ghz Coppermine can execute these instructions at 1 GHz.
A 1.5 GHz Willemette without 1/2 clock tricks would execute these instructions at 750 MHz (feedback after 2 cycles)
A 1.5 GHz Willemette which can feedback the result after 1.5 cycle can however reach 1 GHz for these instructions and uses the ALU at it's maximum performance.

Any processor which shortens the pipeline stage to 67%..99% of the ALU just has to use half clock tricks if it does not want to lose on it's IPC performance.

Regards Hans.



To: Scumbria who wrote (98716)3/16/2000 9:04:00 PM
From: Dan3  Read Replies (2) | Respond to of 1571002
 
Re: ...the ALU, because it runs off the same clock as the rest of the design. It simply completes it's operation in one phase of the clock.

4 of the stages involve pulling data from the cache, then it takes 16 double-speed = 8 clock cycles to complete an integer operation. realworldtech.com
Athlon has a 10 clock cycle integer pipeline. If it can pull data from the cache in 2 clock cycles, Athlon will have the same 8 clock cycle pipeline. The "20 stage pipeline" is either a scam, or, if there really is a lot of logic in there operating on half cycles, it may have the performance of an 8 stage pipeline with the scalability of a 9 stage pipeline. In terms of manufacturing ease and MHZ scalabiliy, Willamette's pipeline may be equal to or shorter than (and so, inferior to) Athlon's! It is almost certainly not twice as long.

Now, I'm not sure I like that, because I am impressed by per clock performance, but perhaps you should re-consider your opinion of Willamette. What if AMD determines that in some cycles up to 3 transistors can change state and they announce that Athlon actually uses a triple pumped 30 stage pipeline. Would that make it any faster? The whole point of a deep pipeline is that it spreads out the work of completing an instruction over more clock cycles, making it easier to design a processor that can run at high clock speed. Cutting the duration of each clock in half, then doubling the number of these half length clocks allocated to complete an instruction leaves the same amount of time available as half as many full length clocks. It's not going to permit any higher speed for a given quality of process. I keep bringing up this point, but you keep focusing on other aspects.

Sorry to go off on a Dennis Miller style rant, and maybe I'm missing something very big, but I think that until Intel releases more details on individual instruction latencies, we should be careful about drawing conclusions regarding the relevant length of the double-pumped pipeline.

Regards,

Dan