SI
SI
discoversearch

We've detected that you're using an ad content blocking browser plug-in or feature. Ads provide a critical source of revenue to the continued operation of Silicon Investor.  We ask that you disable ad blocking while on Silicon Investor in the best interests of our community.  If you are not using an ad blocker but are still receiving this message, make sure your browser's tracking protection is set to the 'standard' level.
Politics : Formerly About Advanced Micro Devices -- Ignore unavailable to you. Want to Upgrade?


To: Petz who wrote (123074)8/25/2000 5:46:52 PM
From: Jim McMannis  Read Replies (1) | Respond to of 1584948
 
Petz,
It seems to me that this P4 is such a radical, Rube Goldberg device that I was wondering...

1. Other that just being a new core is it prone to a lot of glitches?
2. Is it likely there will some software that will crash and burn when running on a P4 system?

Jim



To: Petz who wrote (123074)8/26/2000 12:33:19 AM
From: John Evans  Read Replies (1) | Respond to of 1584948
 
RE: "Ali and John Evans, I don't think there are as many dependencies in the 20 stage pipeline because 20 micro-ops only represents 10 x86 instructions on average."

OK, that's a good point. At this point, we don't know enough about the underlying RISC core of the P4 to make comparisons with the P3.

However, ten instructions is a pretty big span, especially when you consider tight calculation loops such as RC5, for instance.

RC5 is a pretty good example. The innermost loop is about 10 instructions -- adds, rotates, and XOR's. At least half these calculations are data dependent. Also, consider that on the P4 the shift instructions won't be double-pumped (will have high latency). Therefore it seems the P4 pipeline will suffer not only from increased dependency stalls, but increased structural hazards.(Ie, stalls or bubbles that result from too many instructions contending for a limited amount of functional units). I think that in many cases, the overhead from managing these problems will significantly detract from performance, especially when running legacy code which doesn't have a high level of implicit parallelism (most existing code).

That is my main hypothesis: As you increase the pipeline size, you not only suffer from a larger branch prediction penalty, you also have potentially more dependency and scheduling problems. At some point, the penalty from this extra complexity negates any frequency improvement from the longer pipe. Also, consider that the silicon used to manage this complexity could have been put to better use. Consider the small L1 data cache on the P4, for instance. Was the data cache size sacrificed for an increased trace cache?

I think that the 20% IPC loss suggested by Intel may actually be more like 30%. Thus, the P4 will perform below a 1GHz P3 on many benchmarks.