Kapkan, let's clear this up once and for all, shall we?
I was looking at the pipeline diagrams for the P6, Netburst, and K7 micro-architectures. If you really want to know how many main core clocks (not fast or slow clocks) that the core takes to process work in the front end, you should regard the pipeline diagram.
According to the notes I have from Microprocessor Forum (sorry I don't have the links, so you'll have to trust me), the Pentium III (P6 micro-architecture) has a 10-stage pipeline, with the first 5 stages representing the front end. The first 2 stages are for fetching the instruction, and the next three stages are for the decode process. That means the Pentium III decoder actually takes three clocks from beginning to end to decode an instruction. Not that this is extremely relevant, since these are stages in a pipeline, and when properly filled, one instruction will be fully decoded every single clock cycle.
But let's look to the Athlon, or K7 micro-architecture. According to my notes, the Athlon has a 10/15 stage pipeline (15 if you are computing floating point instructions, but only 10 for integer). The first six stages are devoted to the front end. The first is the fetch stage, and the next three are Athlon specific, called Scan, Align1, and Align2. Finally, there are two decode stages, EDec, and IDec. That means that the Athlon requires two clock cycles to decode an instruction. Again, this is irrelevant if the pipeline is filled, since uops will come out of the front end on every clock cycle if this is true.
On the the Netburst micro-architecture, we see that the front end is a relatively short four stages. The first two stages are called TC Next IP, which calculates the instruction pointer for the next trace cache lookup, and the second two are TC Fetch, which is the trace cache fetch stage. Again, if the pipeline is filled, then uops will leave the front end every clock.
The thing that is missing here is the decode phase. Where is it? Actually, every time the trace cache gets a HIT, decoding is not required, and this saves the Pentium 4 valuable latency, and makes the pipeline that much faster. In a sense, you can say that the decode stage is ZERO clock cycles for the Pentium 4, every time there is a trace cache HIT. The trace cache is 12k uops, which I've been told translates into the HIT rate for an 8KB-16KB traditional SRAM cache. I estimate that's between 80-85% of the time. Therefore, 4 out of 5 times, the Pentium 4 has a ZERO cycle latency in decoding instructions. This is much improved over the K7 or P6 micro-architectures.
On the other hand, every time there is a cache MISS, the trace cache needs to decode another instruction. Now, the decode logic has its own pipeline that is running independently of the main pipeline. You may be right about the clock being slower, but the fact is that Intel has never specified. You can have your own guess as to how fast it really is, but arguing over it, like I said, is irrelevant. The Pentium 4 only needs to decode instructions 1/5 of the time, and really tight program loops can obviously make this ever smaller. That's why optimizing code is so important.
So if you want to continue the non-sense, be my guess. I only hope that you can open your mind to the facts, rather than dwelling on apples to oranges comparisons to try and find excuses for the Pentium 4 micro-architecture.
wanna_bmw |