SI
SI
discoversearch

We've detected that you're using an ad content blocking browser plug-in or feature. Ads provide a critical source of revenue to the continued operation of Silicon Investor.  We ask that you disable ad blocking while on Silicon Investor in the best interests of our community.  If you are not using an ad blocker but are still receiving this message, make sure your browser's tracking protection is set to the 'standard' level.
Technology Stocks : Advanced Micro Devices - Moderated (AMD)
AMD 213.50+6.2%Dec 19 3:59 PM EST

 Public ReplyPrvt ReplyMark as Last ReadFilePrevious 10Next 10PreviousNext  
To: Joe NYC who wrote (65046)12/7/2001 2:04:06 AM
From: kapkan4uRead Replies (1) of 275872
 
When a case in a break-less switch is matched, its code and all cases bellow it are executed. So the performance on this benchmark is overwhelmingly _dominated_ by decode and execution latencies of the case code sequence.

Note that several people tried to explain away the P4's poor performance by P4's larger branch misprediction penalty. This is totally bogus, because after a single mispredicted branch to the matched case, we have thousands of branchless instructions executed, while control is falling through the cases to the end of the switch statement.

Suppose that we are comparing a 2.0GHz P4 to a 1.0GHz PIII. The keys to comparing the decoders are:

1. All case instructions have to be 2 or more uops in length. This way the two simple decoders on PIII will be idle, while the remaining complex decoder, will be doing all the work, just like the single P4 decoder will. This will probably favor P4 to some extent because P4's decoder has most likely improved decode latencies comparing to PIII.

2. All case instructions should have execution latencies that are twice as large on P4 than on PIII, because P4 execution is running twice (or 4 times in ALUs) as fast. We want to preserve the measure of relative decode performance and not let P4 to catch up during the execution of the decoded instructions.

3. All case instructions should be register-to-register instructions to eliminate the influence of P4's lower latency, higher bandwidth L1 data and L2 cashes. We can't do anything about the P4's higher L2 performance as it fetches new instructions for decode.

Kap
Report TOU ViolationShare This Post
 Public ReplyPrvt ReplyMark as Last ReadFilePrevious 10Next 10PreviousNext