SI
SI
discoversearch

We've detected that you're using an ad content blocking browser plug-in or feature. Ads provide a critical source of revenue to the continued operation of Silicon Investor.  We ask that you disable ad blocking while on Silicon Investor in the best interests of our community.  If you are not using an ad blocker but are still receiving this message, make sure your browser's tracking protection is set to the 'standard' level.
Technology Stocks : Advanced Micro Devices - Moderated (AMD) -- Ignore unavailable to you. Want to Upgrade?


To: Petz who wrote (5317)8/17/2000 9:18:44 AM
From: pgerassiRespond to of 275872
 
Dear Petz:

Re: P4 IPC

The problem with P4 is that the decode section of the pipeline cannot decode as fast as the execute section of the pipeline can execute. The trace cache is the way P4 can make up this difference. If the benchmark fits in L1 but not in the trace cache, the trace cache will thrash causing a further reduction in IPC due to the extra eight plus cycle waits to get the instructions decoded for the execute end to get back up to speed. Thus something that fits in the trace cache of about 1 thousand decoded instructions (this will always take up more space than the encoded x86 instructions) will get the highest IPC say about 1.8 or so. If the code fits in L1 but thrashes the trace cache, IPC will drop to about 1.2. If the code does not fit in L1, but does in L2, IPC will be about 1.0 (remember if the instruction is not decoded, an unconditional jump followed by another unconditional jump must wait for the address to be decoded which could add another 8 cycles of delay (This section of the pipeline in P3 is much shorter)).

Business software is much more dominated by jumps and calls. It also does not follow the same path each time for the most part. This is why a faster CPU does not affect the benchmarks as much. The speed is much more dominated by video speed, disk access, memory latency, and less dependent on bandwidth, cache size (working set always overfills cache), and the like. Due to this a CAS2 PC133 SDRAM chipset like the 815e will allow a P3 at say 800 MHz to run as fast as a P4 1.4GHz with dual PC800 RDRAM especially if the P3 has faster disk, video, and more memory (The price difference will more than pay for it).

The best case for the P4 is something that fits in the trace cache and where the data can be prefetched before it is used. Something like application that do major amounts of FFTs, JPEG encoding, and video or image processing. Its even better, if mostly SIMD 4 way single precision is done (SSE). Here there may be a small, if any, impact between P4 and P3 assuming that there is the same amount of SSE(2) pipes in both CPUs (I do not remember if there is).

Thus, John, it is the reverse of what you are thinking. In business apps, the P4 is lower on the P3 performance curve than on multimedia and simulation apps. Whether this holds true against the Tbird or Mustang is less sure.

Pete