Kap, Re: "large dense switch statement would always be generated as a computed branch into a table. So the inner loop has a computed branch with randomly generated target. This target can not be predicted by a branch predictor. Additionally, after the branch predictor fails, there will be a trace cache miss because the chance of a hit on this code is worse than 1 in 10. Just compare the number of statements with TC's capacity."
It would probably depend on the compiler, but I admit that you probably know more about compilers than I do.
Clever technique - you can't seem to win your argument about Pentium 4 performance, so you switch the conversation to compilers, and then bait me to argue the topic on your own grounds.
Not that it isn't fun wrestling with pigs (tm Jerry Sanders), but assuming you're right and large dense switch statements end up compiling into tables, it sounds to me like a table would have data, thus making use of the L1 data cache, not the trace cache. The trace cache holds decoded micro-ops, and in the case of a simple loop and table, I am quite confident that the application wouldn't exceed 12k uops.
wbmw |