SI
SI
discoversearch

We've detected that you're using an ad content blocking browser plug-in or feature. Ads provide a critical source of revenue to the continued operation of Silicon Investor.  We ask that you disable ad blocking while on Silicon Investor in the best interests of our community.  If you are not using an ad blocker but are still receiving this message, make sure your browser's tracking protection is set to the 'standard' level.
Technology Stocks : Intel Corporation (INTC) -- Ignore unavailable to you. Want to Upgrade?


To: Noel who wrote (144712)10/4/2001 8:13:33 PM
From: wanna_bmw  Read Replies (1) | Respond to of 186894
 
Noel, Pete's explanations aren't the whole story. Because of the nature of traffic patterns in various areas of the chip, you cannot compare one area with another, and expect some kind of correlation.

For example, he says the following:

"P4 only looks at most one instruction per cycle in the decoder. It pulls at most 3 uops per cycle from the trace cache. It allocates internal resources at 3 uops per cycle (no longer in the front). It dispatches up to 6 uops per cycle through 4 ports. It retires at most 3 uops per cycle and that is the back end. It is very unbalanced."

These observations do not make the core unbalanced. In fact, "balanced" is the wrong term, since pipeline balancing has to do with timing delays between pipeline stages, while Pete is on the subject of bandwidths.

What he fails to realize is that there are a lot of independent operations going on, and this is indicative of many modern processors. A single instruction doesn't necessarily find its way from start to finish by spending a clock cycle in each pipeline stage. Often, it can stay in buffers, or be forwarded through bypass logic, all depending on dependencies with other instructions also present in the pipeline.

Therefore, comparing these bandwidths, and expecting them to be equal is erroneous. They are unequal for the specific reason of differing behaviors in different areas of the chip. Pete's criticism is only coming because he has only a limited knowledge of the way a CPU is supposed to work. Like others on this board, he is a self-learned techno-geek, and as much as you can learn online, I don't expect someone to have an intimate knowledge of a microprocessor like the Pentium 4. Therefore, take what he says with a grain of salt.

As a matter of fact, he also says the following.

"Throughout the pipe, Athlon deals with 3 uop pairs."

In light of his other comments, this is clearly spun to be favorable towards the Athlon. In reality, macro-ops have at most 2 uop pairs, but it depends on the nature of the instruction from which they were derived. Certain instruction streams will perform much better than others on an Athlon. Also, each clock cycle, the decode logic can certainly dispatch less than 3 macro-ops. It depends on the level of ILP, or instruction level parallelism in the code. Not all code blocks will have a sustained rate of 3 macro-ops per clock, and most will certainly be much below this peak level.

Therefore, the Athlon suffers from the same fate as the Pentium 4, and that is the inherently low level of ILP in the x86 instruction set. Application optimizations help, but one reason why Intel went with a different architecture for IA-64 was to overcome this handicap.

Let me assure you, though, that arguing with Pete is an exercise in futility. He'll never listen to an opinion that does not coincide with his own, and he is unwilling to give the courtesy of responding to straightforward questioning or reasoning. The reason why people give him a hard time is because he's proven in the past that he can't carry on a two-way conversation with someone with a different opinion. But enough about Pete. I'll leave that subject by saying I don't agree with him, but would be happy if he shows in the future that he is willing to listen to both sides.

And as for Pentium 4 micro-architecture, clearly sacrifices have been made which lower IPC, and this is evident by many independent tests. Trying to find exactly what these sacrifices are, however, will prove more difficult than simply thumbing through the micro-architectural specifications. If a layman like Pete can spot differences in logic block bandwidths, then engineers whose jobs are to design processors can certainly do the same. Pete's criticisms to an injustice to the design team who made a revolutionary microprocessor. The Athlon is also a revolutionary processor for AMD with respect to the K6, and should be respected as well.

All designs contain with them trade-offs and tough design decisions, but in most cases, we have to trust that the people behind such decisions were informed enough to make the educated choice.

wanna_bmw



To: Noel who wrote (144712)10/4/2001 8:13:54 PM
From: Tenchusatsu  Read Replies (1) | Respond to of 186894
 
Noel, the architecture of the P4 is very complicated, way too much to cover in a simple post. Suffice to say that Pete's oversimplifications are unfair and uninformed. And I'm only talking about the few which even have a grain of truth in them.

Tenchusatsu



To: Noel who wrote (144712)10/5/2001 11:56:06 AM
From: pgerassi  Respond to of 186894
 
Dear Noel:

A mispredicted branch kills the entire pipeline, from just before the retire stage, all the way to the trace cache, at least (it could go about 26 stages back to the decoder plus the number of clocks necessary to have the relevant code cache line loaded). A correctly predicted branch does nothing. The only thing where branches affects the dispatch directly, is that branches can not be taken in the first half of the cycle followed by any operation in the second half. The retirement logic can't afford to carry the bad operation if the branch is mispredicted. Thus any conditional branch, forces the dispatch unit to only do at most 4 uops per cycle when any uop is a conditional branch.

Many other uops "lock out" an execution port from issuing more than 1 uop. Shifts and rotates lock out execution port 1 and can further increase latency, if they are multibit (Intel does not tell how many bits it takes for a shift or rotate to go beyond a single cycle). A multiply or divide "locks up" execution port 0 and for at least 3 additional cycles to as many as 16 for a 32 bit multiply (even higher for a divide). The other two execution ports never dispatch more than 1 uop per cycle. Communicatiion or DSP type code has a lot of shifts and/or multiplies. In the code area, doing the operations in floating point may be quicker than in integer. That would be highly unusual for a GP CPU. In the past, it was always faster to use shifts and adds rather than multiplies for low "on" bit count constant weights. P4 may be one of the few to break this common speed up.

Pete