To: Elmer who wrote (118674 ) 11/21/2000 12:56:27 PM From: pgerassi Read Replies (2) | Respond to of 186894 Dear Elmer: Yo great stupid fool. Have you ever heard of a linear sequential machine in your long engineering career? These are commonly used in all communication applications and Intel touts that P4 enhances the internet experience. Well, Intel must have forgot that internet connects nets, nets imply communications and communications implies LSMs. Rc5 is an implementation of a LSM in GP CPU code. LSMs are composed of three operations, shifts, masks, and exclusive ors. Masks are implemented with AND instructions, shifts with left and right shifts and rotate instructions, and exclusive ors are implemented with parity conditional branches. Parity conditional branches cause P4 problems due to their 50/50 splits in whether a branch is taken or not. This invalidates any branch prediction unit with the exception that the best policy is usually to assume the branch is not taken and to invalidate the speculative execution of instructions when the branch should have been taken. In short pipeline CPUs, this usually causes no penalty in execution time because at most one or two instructions are invalidated that the branch would have skipped. In long pipeline CPUs like P4, this causes four or more instructions that must be dumped and reexecuted with different data. Another penalty for P4 is the lack of a barrel shifter which causes additional latencies for common shift and rotate instructions in LSMs. Since the mask (AND) occurs after the shift in most implementations of LSM algorithms, the increased latency of the shift cannot be filled with usable code causing a loss of efficiency (wasted CPU execution cycles). LSMs are used for encryption, error checking, and other common communication tasks. The P4 design decisions cause the really slow execution of LSM algorithms when compared to current mainstream CPUs including the Celeron, Duron, P3, and Athlon. You asked for code that makes a P4 perform more slowly than a lower clocked P3, well here are a commonly executed group of programs that will always be slow even after a recompile on a P4 when compared to a Celeron (P3 sans half the cache). So your vaunted P4 actually does run slower than a half speed Celeron on LSM based applications even after a recompile. Rotates and shifts are not part of the SIMD integer extensions (MMX, SSE, and SSE2). They are however have always been in x86 (they actually were in the i8080 many years before the i8086 (the first x86 CPU)). Given a barrel shifter (not large in die area at these processes), perhaps P4 could gain ground enough to be faster than any Celeron that was not overclocked but probably not enough to beat out top end P3s. Blame Intel for this stupid design decision. Another area where branch prediction is generally impossible is AI engines used in most games and other such applications like BSS and good high quality independent (multiple different speakers) continuous voice recognition. Most newer games emphasize AI to give more realistic opponents. This percentage of CPU cycles is being budgeted at ever higher amounts. This area causes more pipeline flushes in deep pipe designs like P4 and usually thrash trace caches with ease. Here the slow decode of P4 really hurts. Most discussions of this topic (when they talk about it at all) note that the decode end of P4 acts like it is running at half speed or no more than a P3 running at half the clock. Thus for any code that causes a lot of trace cache thrashing, P4 will act as slow or slower than a P3 at half the clock. For example on large and heavy AI code, a 1.5 GHZ P4 will be slower than a 750 MHz P3. Thus, if the percentage in AI for Q4 is three times as large a Q3 and assuming that Q3 budgets 10% of cycles to AI, P4 will run Q4 at the same speed as a P3 clocked at 2/3 the speed. At 50% AI, P4 equals P3 at 5/9 the speed. At 100% AI, P4 equals P3 at 4/9 the speed (the 3x longer pipe starts really hurting here) based on the above assumptions (taken from a survey of current games). Athlon and Duron will excel at the above games because they do not suffer these implementation problems. And this is the direction of game user and designer desires for the foreseeable future. Furthermore, the things that the P4 excels at will be moved to hardware accelerators in the near future (MPEG compression, T&L, and triangle clipping) where the marginal cost is very cheap (a few bucks). But Elmer, you can't see this because the glasses you use are made very black with the view Intel wants you to see plastered onto the eyeball side of the glasses. Take them off or you will fall off into the abyss Intel hopes you will not see (do you really want to be a lemming Elmer?). Pete