SI
SI
discoversearch

We've detected that you're using an ad content blocking browser plug-in or feature. Ads provide a critical source of revenue to the continued operation of Silicon Investor.  We ask that you disable ad blocking while on Silicon Investor in the best interests of our community.  If you are not using an ad blocker but are still receiving this message, make sure your browser's tracking protection is set to the 'standard' level.
Technology Stocks : Advanced Micro Devices - Moderated (AMD) -- Ignore unavailable to you. Want to Upgrade?


To: kapkan4u who wrote (62220)11/5/2001 1:29:26 PM
From: combjellyRead Replies (1) | Respond to of 275872
 
"It says "one instruction per clock" but the clock is running at half the speed. Nowhere does it say what the decode clock value is."

I think I remember a document from Intel stating that the decoder was running at 750MHz when the system clock was 1.5GHz. I have, however, been unable to find the link. Which is why I say "I think"...



To: kapkan4u who wrote (62220)11/5/2001 2:02:26 PM
From: wanna_bmwRead Replies (3) | Respond to of 275872
 
Kapkan, let's clear this up once and for all, shall we?

I was looking at the pipeline diagrams for the P6, Netburst, and K7 micro-architectures. If you really want to know how many main core clocks (not fast or slow clocks) that the core takes to process work in the front end, you should regard the pipeline diagram.

According to the notes I have from Microprocessor Forum (sorry I don't have the links, so you'll have to trust me), the Pentium III (P6 micro-architecture) has a 10-stage pipeline, with the first 5 stages representing the front end. The first 2 stages are for fetching the instruction, and the next three stages are for the decode process. That means the Pentium III decoder actually takes three clocks from beginning to end to decode an instruction. Not that this is extremely relevant, since these are stages in a pipeline, and when properly filled, one instruction will be fully decoded every single clock cycle.

But let's look to the Athlon, or K7 micro-architecture. According to my notes, the Athlon has a 10/15 stage pipeline (15 if you are computing floating point instructions, but only 10 for integer). The first six stages are devoted to the front end. The first is the fetch stage, and the next three are Athlon specific, called Scan, Align1, and Align2. Finally, there are two decode stages, EDec, and IDec. That means that the Athlon requires two clock cycles to decode an instruction. Again, this is irrelevant if the pipeline is filled, since uops will come out of the front end on every clock cycle if this is true.

On the the Netburst micro-architecture, we see that the front end is a relatively short four stages. The first two stages are called TC Next IP, which calculates the instruction pointer for the next trace cache lookup, and the second two are TC Fetch, which is the trace cache fetch stage. Again, if the pipeline is filled, then uops will leave the front end every clock.

The thing that is missing here is the decode phase. Where is it? Actually, every time the trace cache gets a HIT, decoding is not required, and this saves the Pentium 4 valuable latency, and makes the pipeline that much faster. In a sense, you can say that the decode stage is ZERO clock cycles for the Pentium 4, every time there is a trace cache HIT. The trace cache is 12k uops, which I've been told translates into the HIT rate for an 8KB-16KB traditional SRAM cache. I estimate that's between 80-85% of the time. Therefore, 4 out of 5 times, the Pentium 4 has a ZERO cycle latency in decoding instructions. This is much improved over the K7 or P6 micro-architectures.

On the other hand, every time there is a cache MISS, the trace cache needs to decode another instruction. Now, the decode logic has its own pipeline that is running independently of the main pipeline. You may be right about the clock being slower, but the fact is that Intel has never specified. You can have your own guess as to how fast it really is, but arguing over it, like I said, is irrelevant. The Pentium 4 only needs to decode instructions 1/5 of the time, and really tight program loops can obviously make this ever smaller. That's why optimizing code is so important.

So if you want to continue the non-sense, be my guess. I only hope that you can open your mind to the facts, rather than dwelling on apples to oranges comparisons to try and find excuses for the Pentium 4 micro-architecture.

wanna_bmw