To: andreas_wonisch who wrote (87443 ) 8/23/2002 10:55:11 AM From: Joe NYC Read Replies (2) | Respond to of 275872 Andreas,Note that Tom uses SysMark 2001 here. So it appears that both the '01 and the '02 version are "broken". I am already looking forward to Anand's promised article on this issue... Here is economaniac's take on this: ragingbull.lycos.com actually that same page says that half the benchmark is clearly broken, since the office productivity scores don't scale at all with clockspeed. Bapco is a very curious entity. It is essentially a creature of Intel, physically house in the Intel corporate headquarters. The 2001 Bapco suite differed from previous versions which had been sort of general office productivity benchmarks with common programs weighted by typical usage. When Athlon came out it benched way higher on Bapco 2000 for office applications than PIII. In 2001 they decided that streaming media and multimedia applications were becoming important, as was the ability to multitask. So they took the basic application suite test and ran Windows Media Encoder in the background throughout. As a result WME performance made up fully 60% of the test results. By sheer coincidence, Intel had provided the Microsoft with the code for WME, and it was highly optimized for their processors, including SSE and SSE2 instructions (but not 3dNow of course). It went a step further, it was specifically punitive to non Intel processors, before implementing any SSE instructions it checked for "geniune Intel" processor, with the result that even when Palomino arrived with SSE instructions they were not used. "Patching" the code to remove the Intel only test improved palomino's score on Sysmark by 30-40%. The article you site mentions that, but dismisses it claiming that "real applications are being used, and if those applications make use of the features provided by the P4". The point is that in any benchmark a limited number of applications must "stand in" for the much wider range in common usage. Ideally the performance of the test platforms on the benchmark components would be representative of more general performance on a wide range of similar applications, and the weighting of results would reflect typical usage. Sysmark 2001 violated both of those principles. Athlon actually outperformed both PIII and P4 on most media encoding tasks at that time because of its stronger FPU. WME was both highly optimized for PIII and P4 and punitive toward AMD processors, and so represented the single most favorable comparison for Intel available. The weighting (60%) on WME in the total benchmark suite is absurd (surely you won't argue that nearly two thirds of typical computer usage is encoding MP3s) With the result that the benchmark primarily tested "is Intel?" rather than general application performance. After getting caught on 2001 Bapco went another route in 2002. Once Microsoft updated WME so that it worked on Palomino, Athlon was once again dominating Sysmark. Sysmark 2002 changed the mix of components to stress bandwidth over raw CPU power. As AMD investors we were miffed, since that essentially meant choosing applications based on how much they favor P4, but one can reasonably argue that general usage is going that way. As I understand it there is more to it than that though and a scandal is about to break. Apparently rather than using a fixed weighting of applications to determine benchmark scores, something in Sysmark 2002 weights according to the time to complete the component in the suite. The result is a sort of sum of squares measure. For example, run ten tests, time them in seconds square each result and add. Take the inverse and muliply by some constant to get a final score that increases with performance. Such a test gives little credit for very good performance (ie cutting time from 2 seconds to 1 seconds only improves the bench by 3 points, but penalizes very bad performance going from 10 seconds to 15 costs 115 points). Theoretically such a test would reward platforms that were good across all applications over those that were very good at some and very bad at others. We all know that currently Athlon is much better at some things (especially number crunching) than P4 and P4 is much better at others (especially those that stress bandwidth). What apparently Bapco did was make the tests that favor Athlon short duration tests, so that the absolute diffence in completion time is small even if Athlon is proportionately much faster. The longest tests (in terms of time consumed on both platforms) are almost pure bandwidth tests which strongly favor P4. The result is that Bapco has shifted to a mix of components that favors P4 and essentially thrown out any results that don't favor P4 while mutliplying many fold the ones that do. When hardware sites started benching overclocked P4s they saw something bizarre. Bapco scores scaled better than clockspeed. In fact overclocking the 2.4 Ghz P4 on 400 MHz bus to 3Ghz on 533 ( a straight 25% clockspeed bump which should have acheived about a 10% increase in performance) increased the Sysmark score by 98%! Now we know why. The clockspeed and FSB bump probably reduced test time in the longest components by almost exactly 25%. Since the times are weighted in the final score by duration the weights on those times was also cut by nearly 25%, so their contribution to total time was cut by nearly 50%. Since the reported score scales as the inverse of the summed weighted component scores, and the other components (the ones that favor Athlon) don't contribute meaningfully to the total test time, the reported score doubled. Here is the real kicker. Hammer will have a lot more bandwidth to memory and IO than P4, and will also have SSE optimizations. It might score 4 times higher on Sysmark 2002. Want to bet that something alltogether different and overlooked in the past will turn out to be critical for users in 2003 so that Sysmark will move away from bandwidth sensitive tests and toward tests that emphasize say hyperthreading? E