Re: Unless each micro-op is 0.67 bytes long, the trace cache is definitely bigger than 8K.
X86 instructions are decoded into one or more micro-ops. Hence the use of the term "micro-op", which replaces the term instruction, when referring to what comes out of the trace cache, in Intel's case, or the decoder, in AMD's case.
I don't know the mean number of micro-ops generated, per instruction, in a typical instruction stream, but it's clearly more than one. Harvard architecture chips almost always have same sized instruction and data caches, so an 8K equivalent is about what one would expect from a reasonable design.
Intel claims it's the equivalent of an "8K up to 16K cache." The X86 mixed length instruction architecture is pretty space efficient, if not always so efficient in terms of memory access performance and decoding. I suppose that if the compiler had optimzed the instructions for fetching by padding everything out to 32 bytes (which is done), and the code was filled with very short, very simple, instructions that usually decoded into a single micro op, you could come up with compiler output padded out to the point that there were fewer micro-ops per byte than macro ops.
But notion that 12K micro-ops is the same as 12k instructions (or macro ops) isn't correct. That P4's trace cache holds the approximate equivalent of 8K instructions is probably about as close as we're going to get as a description of its relative size, and would be consistent with the design decisions that Intel has made on all of their other CPUs. |