P4/PIII decoders comparison revisited.
I asked someone to run a simple loop with a single instruction
__asm CLD
unrolled 5000 times in the loop. CLD was chosen because it has 4 uops and can only be decoded by one decoder on PIII. This levels the playing field with P4 which has only one decoder to begin with. The loop has to be unrolled at least 3000 times (4 uops * 3000 == 12,000 uops) to overwrite the trace cache which can hold up to 12,000 uops.
The results
P-III-450MHz 95 M clocks P-III-933MHz 95 M clocks
P4-1400MHz 759 M clocks
So 1.4GHz P4 performs no faster than 175MHz PIII in this test.
Here is the C source to generate the test:
#include <stdio.h>
int main(int argc, char* argv[]) { int i, iterations, max_lines; if (argc != 4) { printf("invocation error: loop_asm.exe iterations max_lines instruction_string\n"); printf("iterations: number of times the main loop is executed\n"); printf("max_lines: number of repeated instructions_string lines\n"); printf("instruction_string: contains \"__asm instruction __asm instruction ...\"\n"); return 1; } sscanf(argv[1], "%d", &iterations); sscanf(argv[2], "%d", &max_lines); printf("#include <stdlib.h>\n"); printf("#include <stdio.h>\n"); printf("unsigned long x;\n"); printf("#define get_stamp __asm RDTSC __asm mov [x], eax\n"); printf("#define get_count __asm RDTSC __asm sub eax, [x] __asm mov [x], eax\n");
printf("int i;\n"); printf("int main() {\n"); printf("get_stamp;\n"); printf("for( i = 0; i < %d; i++ )\n",iterations); printf("{\n"); for (i = 0; i < max_lines; i++) printf("%s\n", argv[3]); printf("}\n"); printf("get_count;\n"); printf("printf(\"clocks = %%d\\n\", x);\n"); printf("return 0;\n}\n"); return 0; } |