SI
SI
discoversearch

We've detected that you're using an ad content blocking browser plug-in or feature. Ads provide a critical source of revenue to the continued operation of Silicon Investor.  We ask that you disable ad blocking while on Silicon Investor in the best interests of our community.  If you are not using an ad blocker but are still receiving this message, make sure your browser's tracking protection is set to the 'standard' level.
Technology Stocks : Intel Corporation (INTC) -- Ignore unavailable to you. Want to Upgrade?


To: Tenchusatsu who wrote (108243)8/24/2000 3:14:22 PM
From: pgerassi  Read Replies (1) | Respond to of 186894
 
Dear Tench:

You forget to look at the history you like to spout about so much. The original presentation at the spring IDF is where I got those numbers from. The presentation can be found at:

developer.intel.com.

Adding this to the reporting available between then and before the recent fall IDF burst, leads to some logical conclusions. Intel has always seem to equally split instruction and data caches in their mainstream line and so do many others. In all cases, except for P4, the instruction cache is equal to or smaller than the data cache at the same level. A decoded P4 micro op is something that works on the underlying 32 bit RISC platform (it could be 64 bit). These decoded micro ops have to include the addresses (immediate, the instructions address (how else can they know which decoded instruction goes with which address?), and others), the operation code, and all the other things in after decode stage of any RISC CPU. There is 12 thousand of these now. There is only 128 lines of 32 bytes each in the data cache (512 SSE operands, 1K Doubles, or 2K longs). Now if the size of the instruction (trace) cache is as normally sized, there should be somewhere between 512 and 1K micro ops in size depending on how the cache is organized and how big the micro op is.

Remember, Intel itself stated that the P4 would be only slightly larger than the P3. It is now more than twice as big. The sequence of the enlargement probably went something like this:

1) First spin. Double the length of the pipe and add our gee whiz double clocked ALU. Results show that the speed is about 80% faster but IPC suffers by 45% so little gain in performance. Bottleneck is the very long misprediction penalty.

2) Second spin. Solution, add trace cache which lengthens the pipe to 28 stages and add better misprediction techniques. Speed gain now 90% faster and IPC drop now 40% so more gain but will not be enough over Athlon. Need more IPC. Bottleneck is the trace cache is thrashing too much.

3) Third spin. Solution, quadruple trace cache size. Speed gain remains at 90% and IPC drop now 35%. Getting better but Athlon now bins higher, need even more performance. Bottleneck still trace cache.

4) Fourth spin. Solution, quadruple trace cache size and to help trace cache misses, increase decode power. Speed gain drops to 80% (heat problems) and IPC drop now 25%. Now comfortably above P3's sweet spot at 800 MHz but most benchmarks are unfavorable at the pushed speed of 1.13G. Need it right now! Die size is huge but, at least we can sell these. Push it out the door and we will continue to see if we can shrink this die by seeing what we can take out.

Now this is speculative, but agrees with the delaying of P4, fits the one data point we have and puts the comparison demo in a better light (using a real comparison than against the all stops pushed 1.13G).

Pete