Silicon Investor (SI) -- The First Internet Community

STOCKTALK

We've detected that you're using an ad content blocking browser plug-in or feature. Ads provide a critical source of revenue to the continued operation of Silicon Investor. We ask that you disable ad blocking while on Silicon Investor in the best interests of our community. If you are not using an ad blocker but are still receiving this message, make sure your browser's tracking protection is set to the 'standard' level.

Technology Stocks : Intel Corporation (INTC) -- Ignore unavailable to you. Want to Upgrade?

To: kapkan4u who wrote (150860)	12/3/2001 5:56:37 PM
From: Charles Gryba	Respond to of 186894

Kap I did not have do that. I took a timestamp before the start of the loop and after. At the end I cout the difference. C p.s. I am still waiting for P4 results

To: kapkan4u who wrote (150860)	12/3/2001 5:57:48 PM
From: wanna_bmw	Respond to of 186894

Kap, what speed Pentium III and what speed Pentium 4? wbmw

To: kapkan4u who wrote (150860)	12/3/2001 6:46:11 PM
From: Timothy Liu	Read Replies (2) \| Respond to of 186894

Here you go. You are quoting 'ticks' rather than the run time. So it look like a 2G P4 is running slightly faster than a 1G P3, quite contrary to your original speculation that the 2G P4 is going to be 30-50% slower. The difference is IMO just due to processor stalls. P4 having 20 pipeline stages vs PIII having less (I don't know how many but I remembered PPro had 10 stage). Just my 0.02 Tim

To: kapkan4u who wrote (150860)	12/5/2001 12:33:13 AM
From: Joe NYC	Respond to of 186894

Kap, Looking at your numbers, there is a small error in arithmetic, and the trace cache is blown already in second case. Your line: 2000 132/69 == 1.375 should be 2000 132/69 == 1.913 It would be interesting to see the results from 0 to 2000 in increments of some 100 or 200. Is Charles running the code? Can you re-run it? Just out of curiosity, not that it will add much. I just want to see the jump as the trace cache gets eliminated from the equation. Joe

To: kapkan4u who wrote (150860)	12/5/2001 3:02:09 AM
From: Joe NYC	Read Replies (3) \| Respond to of 186894

Kap, "# of cases" "code size" "P-III ticks" "P4 ticks" 1000 44k 32/32 46/45 2000 68k 69/69 133/132 3000 92k 106/103 205/199 5000 132k 190/180 372/349 10000 232k 399/369 780/716 The ratio of P4/P3 ticks for different # of cases: 1000 45/32 == 1.4 2000 132/69 == 1.375 Should be 1.913 3000 199/103 == 1.93 5000 349/180 == 1.93 10000 716/369 == 1.94 Let me understand this, (I may be a little slow). The line: 1000 44k 32/32 46/45 means that your test program had 1,000 cases in the case statement, the size of the code was 44K, Piii needed 32 million clock ticks to finish the test and P4 took 45 seconds. The ration of 45 / 32 = is 1.4, which means that Piii is 1.4 times more efficient as far as how much it can do per clock tick compared to P4. Now, when we get to 2,000 cases, and code size of 68K (and whatever size that translates in Trace Cache storage), the code no longer fits inside the trace cache, and performance drops. The reason for performance drop is because the CPU has to continuously reload the code, on every loop iteration. This renders Trace Cache completely useless. All of the code has to go through the decoder, and it now becomes the bottleneck. The reason it become the bottleneck is that as long as the code fits inside Trace Cache, we have one level of performance, once it no longer fits, we have another, lower level of performance. At the higher level of performance, the decoder is not used, at the lower level of performance, the decoder is used. When the P4 operated with Trace Cache (the code fit inside it), the efficiency of P4 was slightly lower that of Piii. As wbmw and other pointed out, the length of the P4 pipeline takes it's toll, which results in P3 being 1.4 times more efficient than P4 (or P4 being .711 times as efficient as P3). Let's assume that the length of pipeline is the bottleneck of P4. But when we introduce a new bottleneck, the old bottleneck is no longer what's constraining the performance, it is the new bottleneck. So for further discussion, we can set aside the old bottleneck, and concentrate on the new one, the decoder. Between 1000 and 2000 cases, P4 switches from being a processor with Trace Cache to a processor without the Trace Cache (hint: Piii) #cases clock cycles per loop iteration per case 1000 4.50 2000 6.60 3000 6.63 5000 6.98 10000 7.16 For comparison, let's look at Piii. Piii does not have Trace Cache. The instruction stream arriving has to be decoded. Every instruction that needs to be processed needs to be decoded first. Let's look at the performance: #cases clock cycles per loop iteration per case 1000 3.20 2000 3.45 3000 3.43 5000 3.60 10000 3.69 It remains fairly steady, decreasing steadily as probability of L1 and L2 hit decreases. Now if length of P4 pipeline (after decode stages) is the bottleneck of P4, shortness of P3 pipeline (after decode stages) should be an asset, therefore not the bottleneck. The bottleneck of Piii is the decode, and the reason Intel designed the Trace Cache into P4 was to reduce the penalty of decode bottleneck. So now we have 2 processors and we are testing them under condition where the decode is the bottleneck of both processors. Let's restate the original hypothesis that the decoder of P4 CPU runs not at the clock speed Intel sells their CPUs at, but at 1/2 time that clock speed. Under this scenario, 2 GHz P4 is really 1 GHz P4, with some performance and marketing enhancing tricks. Let's look at the numbers. The number we are looking at is the # of clock cycles the CPU needs to process one case in one iteration. But which clock? The marketing clock (mc) or the decoder clock (dc)? #cases P4(mc) P4(dc) Piii 2000 6.60 3.30 3.45 3000 6.63 3.32 3.43 5000 6.98 3.49 3.60 10000 7.16 3.58 3.69 These numbers show that the decoder of Pentium 4 does only half of the work of Piii processor. It is because P4 has the same decoder as Piii, and it is running only at 1/2 of the P4 rated speed. What does it mean? It means that buying a 2 GHz P4 gives you 1 GHz Piii with some wiz-bang features, which in some cases takes the performance beyond that of 1 GHz Piii, in other cases it does not. But the basis of P4 processor is the old Piii decoder, which Intel can't get to run faster that 1 GHz using their .18 technology (both Piii and P4 hit the brick wall at 1 GHz). Intel marketing wanted a lot of GHz, so the engineers delivered in parts of the CPU where they could, but in some of the most critical parts they could not. Selling P4 processor with small sections running at the marketing speed, but the most critical parts running at half speed may be a clever marketing trick to some, but others may consider it to be fraud. Joe