To: kapkan4u who wrote (150860 ) 12/5/2001 3:02:09 AM From: Joe NYC Read Replies (3) | Respond to of 186894 Kap,"# of cases" "code size" "P-III ticks" "P4 ticks" 1000 44k 32/32 46/45 2000 68k 69/69 133/132 3000 92k 106/103 205/199 5000 132k 190/180 372/349 10000 232k 399/369 780/716 The ratio of P4/P3 ticks for different # of cases: 1000 45/32 == 1.4 2000 132/69 == 1.375 Should be 1.913 3000 199/103 == 1.93 5000 349/180 == 1.93 10000 716/369 == 1.94 Let me understand this, (I may be a little slow). The line: 1000 44k 32/32 46/45 means that your test program had 1,000 cases in the case statement, the size of the code was 44K, Piii needed 32 million clock ticks to finish the test and P4 took 45 seconds. The ration of 45 / 32 = is 1.4, which means that Piii is 1.4 times more efficient as far as how much it can do per clock tick compared to P4. Now, when we get to 2,000 cases, and code size of 68K (and whatever size that translates in Trace Cache storage), the code no longer fits inside the trace cache, and performance drops. The reason for performance drop is because the CPU has to continuously reload the code, on every loop iteration. This renders Trace Cache completely useless. All of the code has to go through the decoder, and it now becomes the bottleneck. The reason it become the bottleneck is that as long as the code fits inside Trace Cache, we have one level of performance, once it no longer fits, we have another, lower level of performance. At the higher level of performance, the decoder is not used, at the lower level of performance, the decoder is used. When the P4 operated with Trace Cache (the code fit inside it), the efficiency of P4 was slightly lower that of Piii. As wbmw and other pointed out, the length of the P4 pipeline takes it's toll, which results in P3 being 1.4 times more efficient than P4 (or P4 being .711 times as efficient as P3). Let's assume that the length of pipeline is the bottleneck of P4. But when we introduce a new bottleneck, the old bottleneck is no longer what's constraining the performance, it is the new bottleneck. So for further discussion, we can set aside the old bottleneck, and concentrate on the new one, the decoder. Between 1000 and 2000 cases, P4 switches from being a processor with Trace Cache to a processor without the Trace Cache (hint: Piii) #cases clock cycles per loop iteration per case 1000 4.50 2000 6.60 3000 6.63 5000 6.98 10000 7.16 For comparison, let's look at Piii. Piii does not have Trace Cache. The instruction stream arriving has to be decoded. Every instruction that needs to be processed needs to be decoded first. Let's look at the performance: #cases clock cycles per loop iteration per case 1000 3.20 2000 3.45 3000 3.43 5000 3.60 10000 3.69 It remains fairly steady, decreasing steadily as probability of L1 and L2 hit decreases. Now if length of P4 pipeline (after decode stages) is the bottleneck of P4, shortness of P3 pipeline (after decode stages) should be an asset, therefore not the bottleneck. The bottleneck of Piii is the decode, and the reason Intel designed the Trace Cache into P4 was to reduce the penalty of decode bottleneck. So now we have 2 processors and we are testing them under condition where the decode is the bottleneck of both processors. Let's restate the original hypothesis that the decoder of P4 CPU runs not at the clock speed Intel sells their CPUs at, but at 1/2 time that clock speed. Under this scenario, 2 GHz P4 is really 1 GHz P4, with some performance and marketing enhancing tricks. Let's look at the numbers. The number we are looking at is the # of clock cycles the CPU needs to process one case in one iteration. But which clock? The marketing clock (mc) or the decoder clock (dc)? #cases P4(mc) P4(dc) Piii 2000 6.60 3.30 3.45 3000 6.63 3.32 3.43 5000 6.98 3.49 3.60 10000 7.16 3.58 3.69 These numbers show that the decoder of Pentium 4 does only half of the work of Piii processor. It is because P4 has the same decoder as Piii, and it is running only at 1/2 of the P4 rated speed. What does it mean? It means that buying a 2 GHz P4 gives you 1 GHz Piii with some wiz-bang features, which in some cases takes the performance beyond that of 1 GHz Piii, in other cases it does not. But the basis of P4 processor is the old Piii decoder, which Intel can't get to run faster that 1 GHz using their .18 technology (both Piii and P4 hit the brick wall at 1 GHz). Intel marketing wanted a lot of GHz, so the engineers delivered in parts of the CPU where they could, but in some of the most critical parts they could not. Selling P4 processor with small sections running at the marketing speed, but the most critical parts running at half speed may be a clever marketing trick to some, but others may consider it to be fraud. Joe