On aggregate benchmarks...
It seems to me that a much more reasonable scheme for comparing two chips (A and B) across a series of sub-benchmarks (S_1, S_2, ... S_n) would be as follows:
Decide on the relative weights (w_1, w_2, ..., w_n) you'd like each component to have. (e.g, all w_i = 1).
For S_i, compute R_i = (time A takes on S_i) / (time B takes on S_i).
Let R be the weighted *multiplicative* average of the R_i: R = exp( sum_over_i( w_i * log(R_i)) / sum_over_i( w_i ))
Then you can say: "on average, chip A takes R times as long as chip B on our aggregate benchmark".
To see the nice properties, consider A and B performing identically on 7 out of 8 tests. On the 8th, A takes twice as long. With equal weights, you'd say:
"On average A takes 1.09 times as long as B", i.e. "A is 9% slower"
Which seems about right, intuitively, based on equal weights. (Note: 1.09... is the 8th root of 2)
This style of comparison also means you end up comparing 2 chips *directly*, not via a 3rd chip baseline used to set arbitrary constants (SysMark 20 = 700MHz Celeron, or the like (just a made up example)).
And it allows the weights you select to actually have meaning-- it doesn't matter how long a particular task takes, since it's the ratio that gets averaged.
In short: The weighted multiplicative average of subtask time ratios would seem to produce fairly transparent, intuitive "benchmarks".
Doug
p.s. The technique can also be used to combine scores across multiple benchmarks-- something the review sites seem to lack: Just set each R_i to be the ratio of both chips on benchmark i, choose weights, and voila: "On average, chip A gets R times the score of B across our benchmark suite of Sysmark2003, Quake3, ScienceBench, and SandraThroughput" (Just make sure each component benchmark is of the same flavor ("higher = faster" or "lower = faster") or use an inverse to change flavors. |