Joe, Re: "How can you possibly optimize your program for the size of the trace cache?"
Simple. The trace cache is 12k uops. Applications contain structures called 'loops', which is code that runs over and over again. To optimize for the trace cache, make sure that the number of instructions in these commonly executed loops does not produce more than 12k uops. Subroutines are another code structure, and many are accessed regularly by applications. For the code writers: make sure that these routines do not exceed the number of instructions that would produce 12k uops. As long as you can minimize the number of trace cache MISSES, your application will run a lot faster, because the decoder will very rarely ever need to be used.
Re: "you have no control what's in the trace cache, since there may be other apps running in the background, the more apps, more trace cache trashing there has to be"
From what I understand, Windows task switches between applications at a rate of 50us, which can be altered by means of priority switching. 50us is 50,000ns. For a 2GHz CPU, that's 100,000 instructions from the same application before it switches. IMO, that's enough to optimize for the trace cache, and have a long period of CPU time where the decode accesses can be minimized.
wanna_bmw |