SI
SI
discoversearch

We've detected that you're using an ad content blocking browser plug-in or feature. Ads provide a critical source of revenue to the continued operation of Silicon Investor.  We ask that you disable ad blocking while on Silicon Investor in the best interests of our community.  If you are not using an ad blocker but are still receiving this message, make sure your browser's tracking protection is set to the 'standard' level.
Technology Stocks : Intel Corporation (INTC)
INTC 37.83-4.3%Dec 12 3:59 PM EST

 Public ReplyPrvt ReplyMark as Last ReadFilePrevious 10Next 10PreviousNext  
To: kapkan4u who wrote (150860)12/5/2001 3:02:09 AM
From: Joe NYC  Read Replies (3) of 186894
 
Kap,

"# of cases" "code size" "P-III ticks" "P4 ticks"

1000 44k 32/32 46/45
2000 68k 69/69 133/132
3000 92k 106/103 205/199
5000 132k 190/180 372/349
10000 232k 399/369 780/716

The ratio of P4/P3 ticks for different # of cases:

1000 45/32 == 1.4
2000 132/69 == 1.375 Should be 1.913
3000 199/103 == 1.93
5000 349/180 == 1.93
10000 716/369 == 1.94


Let me understand this, (I may be a little slow). The line:
1000 44k 32/32 46/45
means that your test program had 1,000 cases in the case statement, the size of the code was 44K, Piii needed 32 million clock ticks to finish the test and P4 took 45 seconds. The ration of 45 / 32 = is 1.4, which means that Piii is 1.4 times more efficient as far as how much it can do per clock tick compared to P4.

Now, when we get to 2,000 cases, and code size of 68K (and whatever size that translates in Trace Cache storage), the code no longer fits inside the trace cache, and performance drops. The reason for performance drop is because the CPU has to continuously reload the code, on every loop iteration. This renders Trace Cache completely useless. All of the code has to go through the decoder, and it now becomes the bottleneck.

The reason it become the bottleneck is that as long as the code fits inside Trace Cache, we have one level of performance, once it no longer fits, we have another, lower level of performance. At the higher level of performance, the decoder is not used, at the lower level of performance, the decoder is used.

When the P4 operated with Trace Cache (the code fit inside it), the efficiency of P4 was slightly lower that of Piii. As wbmw and other pointed out, the length of the P4 pipeline takes it's toll, which results in P3 being 1.4 times more efficient than P4 (or P4 being .711 times as efficient as P3). Let's assume that the length of pipeline is the bottleneck of P4. But when we introduce a new bottleneck, the old bottleneck is no longer what's constraining the performance, it is the new bottleneck. So for further discussion, we can set aside the old bottleneck, and concentrate on the new one, the decoder.
Between 1000 and 2000 cases, P4 switches from being a processor with Trace Cache to a processor without the Trace Cache (hint: Piii)

#cases clock cycles per loop iteration per case
1000 4.50
2000 6.60
3000 6.63
5000 6.98
10000 7.16


For comparison, let's look at Piii. Piii does not have Trace Cache. The instruction stream arriving has to be decoded. Every instruction that needs to be processed needs to be decoded first. Let's look at the performance:

#cases clock cycles per loop iteration per case
1000 3.20
2000 3.45
3000 3.43
5000 3.60
10000 3.69


It remains fairly steady, decreasing steadily as probability of L1 and L2 hit decreases.

Now if length of P4 pipeline (after decode stages) is the bottleneck of P4, shortness of P3 pipeline (after decode stages) should be an asset, therefore not the bottleneck. The bottleneck of Piii is the decode, and the reason Intel designed the Trace Cache into P4 was to reduce the penalty of decode bottleneck.

So now we have 2 processors and we are testing them under condition where the decode is the bottleneck of both processors. Let's restate the original hypothesis that the decoder of P4 CPU runs not at the clock speed Intel sells their CPUs at, but at 1/2 time that clock speed. Under this scenario, 2 GHz P4 is really 1 GHz P4, with some performance and marketing enhancing tricks.

Let's look at the numbers. The number we are looking at is the # of clock cycles the CPU needs to process one case in one iteration. But which clock? The marketing clock (mc) or the decoder clock (dc)?


#cases P4(mc) P4(dc) Piii
2000 6.60 3.30 3.45
3000 6.63 3.32 3.43
5000 6.98 3.49 3.60
10000 7.16 3.58 3.69


These numbers show that the decoder of Pentium 4 does only half of the work of Piii processor. It is because P4 has the same decoder as Piii, and it is running only at 1/2 of the P4 rated speed.

What does it mean? It means that buying a 2 GHz P4 gives you 1 GHz Piii with some wiz-bang features, which in some cases takes the performance beyond that of 1 GHz Piii, in other cases it does not. But the basis of P4 processor is the old Piii decoder, which Intel can't get to run faster that 1 GHz using their .18 technology (both Piii and P4 hit the brick wall at 1 GHz).

Intel marketing wanted a lot of GHz, so the engineers delivered in parts of the CPU where they could, but in some of the most critical parts they could not. Selling P4 processor with small sections running at the marketing speed, but the most critical parts running at half speed may be a clever marketing trick to some, but others may consider it to be fraud.

Joe
Report TOU ViolationShare This Post
 Public ReplyPrvt ReplyMark as Last ReadFilePrevious 10Next 10PreviousNext