Noel, Pete's explanations aren't the whole story. Because of the nature of traffic patterns in various areas of the chip, you cannot compare one area with another, and expect some kind of correlation.
For example, he says the following:
"P4 only looks at most one instruction per cycle in the decoder. It pulls at most 3 uops per cycle from the trace cache. It allocates internal resources at 3 uops per cycle (no longer in the front). It dispatches up to 6 uops per cycle through 4 ports. It retires at most 3 uops per cycle and that is the back end. It is very unbalanced."
These observations do not make the core unbalanced. In fact, "balanced" is the wrong term, since pipeline balancing has to do with timing delays between pipeline stages, while Pete is on the subject of bandwidths.
What he fails to realize is that there are a lot of independent operations going on, and this is indicative of many modern processors. A single instruction doesn't necessarily find its way from start to finish by spending a clock cycle in each pipeline stage. Often, it can stay in buffers, or be forwarded through bypass logic, all depending on dependencies with other instructions also present in the pipeline.
Therefore, comparing these bandwidths, and expecting them to be equal is erroneous. They are unequal for the specific reason of differing behaviors in different areas of the chip. Pete's criticism is only coming because he has only a limited knowledge of the way a CPU is supposed to work. Like others on this board, he is a self-learned techno-geek, and as much as you can learn online, I don't expect someone to have an intimate knowledge of a microprocessor like the Pentium 4. Therefore, take what he says with a grain of salt.
As a matter of fact, he also says the following.
"Throughout the pipe, Athlon deals with 3 uop pairs."
In light of his other comments, this is clearly spun to be favorable towards the Athlon. In reality, macro-ops have at most 2 uop pairs, but it depends on the nature of the instruction from which they were derived. Certain instruction streams will perform much better than others on an Athlon. Also, each clock cycle, the decode logic can certainly dispatch less than 3 macro-ops. It depends on the level of ILP, or instruction level parallelism in the code. Not all code blocks will have a sustained rate of 3 macro-ops per clock, and most will certainly be much below this peak level.
Therefore, the Athlon suffers from the same fate as the Pentium 4, and that is the inherently low level of ILP in the x86 instruction set. Application optimizations help, but one reason why Intel went with a different architecture for IA-64 was to overcome this handicap.
Let me assure you, though, that arguing with Pete is an exercise in futility. He'll never listen to an opinion that does not coincide with his own, and he is unwilling to give the courtesy of responding to straightforward questioning or reasoning. The reason why people give him a hard time is because he's proven in the past that he can't carry on a two-way conversation with someone with a different opinion. But enough about Pete. I'll leave that subject by saying I don't agree with him, but would be happy if he shows in the future that he is willing to listen to both sides.
And as for Pentium 4 micro-architecture, clearly sacrifices have been made which lower IPC, and this is evident by many independent tests. Trying to find exactly what these sacrifices are, however, will prove more difficult than simply thumbing through the micro-architectural specifications. If a layman like Pete can spot differences in logic block bandwidths, then engineers whose jobs are to design processors can certainly do the same. Pete's criticisms to an injustice to the design team who made a revolutionary microprocessor. The Athlon is also a revolutionary processor for AMD with respect to the K6, and should be respected as well.
All designs contain with them trade-offs and tough design decisions, but in most cases, we have to trust that the people behind such decisions were informed enough to make the educated choice.
wanna_bmw |