Around the time the PARAM 1000 was announced there was a news Item in clari.world.asia.india where some Indian scientist claimed that its real throughput was only 4% of the peak. Peak performance numbers are usually estimated multiplying single CPU numbers by the number of CPUs and are basically meaningless for supercomputers.
After the N tests, there was a news item, (I don't remember if I read it at Rediff or the Indian Express), where it was mentioned that the PARAM was not and is not being used for the nuclear program.
It appears that PARAM is little more than a "welfare program for the elite", catering to the needs of a few bureaucratic-minded scientists who want to jet-set around the world attending conferences, seminars, symposiums etc. etc. As for those who evaluate such projects, the less said the better -- they are the politicians of India, and very often they are technically illiterate (and sometimes even literally illiterate! :-) )
As for the claims about "peak performance" numbers, they are not even worth the paper they are written on. In more technical language, this is how a supercomputer hardware expert explained it to me :
"Basically, the term MegaFLOPS is a very misleading and meaningless term, unless it is used in conjunction with a standard and well-defined program, traditionally known as a benchmark. In the floating-point numeric field, this is the LINPACK 100 x 100 All-FORTRAN 64-bit benchmark. In particular, there are restrictions on how the benchmark may be executed, in order that the results may qualify for inclusion in the report. No alteration of the source is permitted. Most combinations of compiler switches are permitted. The use of the term MegaFLOPS, other than in conjunction with the LINPACK benchmark (and a few other well-known benchmarks), has very little significance.
The reason is as follows: it is very trivial to develop an architecture that has arbitrarily high "peak" performance or "hypothetical" performance, if one restricts the problem domain to one of your own choice. I will illustrate this with an example. Consider the following machine:
D0-D63 ------------------------------------------------------------- | | | | | | | | X0 Y0 X1 Y1 X2 Y2 ........ Xn Yn \ / \ / \ / \ / ALU0 ALU1 ALU2 ALUn | | | | CLK --------Z0 -----------Z1 ------------Z2 ------ ....... ----Zn | | | |
In the crude diagram above, X0...Xn, Y0...Yn are 64-bit memory mapped (say) registers, ALU0..ALUn are 64-bit ALUs that operate on an X,Y pair in some predefined way, say multiplication, and Z0...Zn are output registers that are all clocked with a single CLK line.
We can now load X0...Xn, Y0...Yn at leisure, allow enough time for the ALUs to propagate the result to their respective outputs, and then clock CLK once. Hey presto! We've just done n 64-bit multiplications in one clock cycle (even less, actually, just the edge of CLK was sufficient for all the results to become available immediately). By increasing n, one can get arbitrarily high "MegaFLOPS".
The main fallacy in this example is the notion of "where-independence". It is always possible to get arbitrarily high performance (even infinity), if you're not interested in doing anything with the results, and are willing to spend enough time moving the operands where you want them to be, without incurring measured benchmark time.
To avoid these pitfalls, the use of standardized benchmarks like LINPACK has to enforce certain rules, in particular, the "where" and "when" dependence. The input is available at a spatially and temporally defined location (a 100 x 100 matrix in memory) and the output is also similarly well-defined.
Why LINPACK? Because it has been found through years of experience is it is difficult to fool it merely by tricks and gimmicks like compiler optimization switches, caches, etc. Moreover, it does real computation that is commonly found in scientific problems, and there is a high correlation between the performance of the machine on LINPACK and its performance on many common scientific problems.
With this little explanation out of the way, we immediately discover the fallacy in the 6 GigaFLOPs report. Was it running a standardized benchmark, like LINPACK? No. What benchmark was it actually running? Not mentioned. What is mentioned is "6 GigaFLOPS hypothetical peak performance". Not really worth even bothering with. As I demonstrated with my crude example, I can easily construct machines with arbitrarily high *peak* performance.
For real problems that help people live better lives, it is *sustained* performance that counts, actually (price/sustained performance). The LINPACK report that I mentioned will give you a good idea of where state of the art is today on sustained performance on compiled, well-defined code. I'll be searching for it later today, but I'll be surprised if any machine has exceeded a GigaFLOP on it. In 1988, the record was 46 MFLOPS for an SMP Cray and about 18 MFLOPS for a uniprocessor Cray. These levels have been reached by Pentium machines today (1997).
There's no indication that C-DAC has ever bench-marked their machines with the LINPACK benchmark, and if they have, what the result for an All-FORTRAN 64-bit 100 x 100 matrix actually is. I'll be *very* surprised if it reaches 10 MFLOPS and fairly surprised if it crosses 3 MFLOPS. You can verify the numbers in the LINPACK report by looking for machines with the same CPUs (i860, T805)." |