Constantine, Re: "why does wbmw keep saying that the P3 has 3 decoders and the P4 one but you say they have the same decoder? Which one is right?"
Joe is incorrect. He is basing his assumption that both processors have the same decoder, but they do not. It's amazing to see such stupidity on this thread right now. Some of the AMDroids so badly want to fit their preconceived conclusion, that they are now bending the facts to fit their own twisted believes.
The *FACT* of the matter is that the Pentium 4 has 1 complex decoder running at up to 2GHz. The Pentium III has 1 complex decoder and 2 simple decoders running at up to 1GHz. What's the difference between a "simple" and "complex" decoder? A simple decoder can decode any instruction that translates into a single uop (i.e. the ADD instruction). A complex decoder can decode any instruction. Why have two kinds? A simple decoder takes far less logic - thus, smaller die size.
Therefore, consider the application. The case statements on each iteration contain an ADD instruction. Kap says the case structure is accessed through some kind of hash table. Therefore, there are some memory movement instructions, some comparison instructions, and some branch instructions. The latter two are probably combined into a compare and branch instruction. Such an instruction requires a complex decoder. The memory instructions also require a complex decoder. The ADD instructions can use the simple decoders.
So on every iteration, you will find that two or one decoder of the Pentium III can be used per clock. Two if there is an ADD instruction inside the prefetch cache, and one if there is not. Since Kap's loop is predominately based around the inside most summation statement, we can assume there are a lot of ADD instructions. Therefore, we expect two decoders (on average) on a Pentium III per clock, while one is running on the Pentium 4.
That accounts for the discrepancy between the two cores if you ensure that there is always an trace cache miss on the Pentium 4.
wbmw |