SI
SI
discoversearch

We've detected that you're using an ad content blocking browser plug-in or feature. Ads provide a critical source of revenue to the continued operation of Silicon Investor.  We ask that you disable ad blocking while on Silicon Investor in the best interests of our community.  If you are not using an ad blocker but are still receiving this message, make sure your browser's tracking protection is set to the 'standard' level.
Politics : Formerly About Advanced Micro Devices

 Public ReplyPrvt ReplyMark as Last ReadFilePrevious 10Next 10PreviousNext  
To: Duncan Baird who started this subject8/25/2000 5:07:37 AM
From: John Evans  Read Replies (2) of 1584948
 
Three points about the P4:

Twenty Stage Pipeline

There's been a lot of talk about branch prediction penalties on the P4, but what about pipeline bubbles? Specifically, does the P4 have enough functional units to sustain its 20 stages? Also, what about pipeline stalls? I don't know the average number of dependencies in twenty instructions of x86 code. But I would, however, estimate that number as at least one. Therefore, wouldn't the P4's 20-stage pipeline suffer from continual stalls? Out of order execution might help this, but never-the-less, I think there's a big problem. Academic papers suggest that for RISC processors, the opportunity for instruction level parallelism (ILP) is around 6-10 instructions, and most designs (such as the pentium, althon, & alpha) are designed with this in mind. Of course, there have also been papers which contend that some code streams (typically DSP and other vectorizable code) have much higher ILP potential. Obviously, it was Intel's focus with the P4 to exploit this opportunity. I don't think that was the right strategy, however.

Net-stream Architecture

"Net-stream architecture" is Intel's buzzword for the P4's high potential for parallelism. And, as I mentioned above, on some code the P4 should perform very well. Unfortunately, the current trend is to move that sort of processing off the CPU and onto special purpose processors. For instance, many sound cards now accelerate MP3 playback. Soon, they will also handle encoding. Graphics cards are responsible for more and more of the rendering pipeline. Granted, the first attempts at hardware assisted transform and lighting were not very impressive. But Nvidia seems determined to push the GPU concept, and I would not be surprised if soon, most of the DirectX library will execute on a GPU, not a CPU.

Double-pumped ALU

Is this really a feature? I see it more as a kludge. As I stated above, I have some doubts about the P4's ability to feed twenty stages. Intel could have increased the number of ALUs, but that would have increased the die size, which was already too big. It appears that Intel solved this problem with the double-pump trick. But in doing so, haven't they limited the scalability of the processor? After all, for every 1-step speedup of the processor, the ALU will have to speed up twice as much, and I fear the ALU's are near the limit of the current process. I imagine that on the next process shrink, Intel will drop the double pump "feature" and add more functional units. Until then, however, I think that 2GHz is doubtful.

Another spurious feature is the trace-cache. Again, when we consider the 20-stage pipeline, it's clear that the scheduler must aggressively re-order code. On less aggressive designs, decoding and scheduling can be done in real-time. (Yes, there are exceptions, and a small cache certainly helps for tight loops.) But on the P4, I imagine the algorithms are much more complex. Therefore, it seems that Intel HAD to cache the micro-ops, otherwise the scheduler could not keep up with the pipeline. In other words, the trace cache is another so-called "feature" which would otherwise be unnecessary, if not for an outrageously long pipeline.

It appears that Intel have engineered the ultimate DSP/CPU hybrid. This would be great if we needed it. But we don't. On real benchmarks, such as Linux kernel compile, I expect a 1.4GHz P4 to perform BELOW a 1GHz P3.

.
Report TOU ViolationShare This Post
 Public ReplyPrvt ReplyMark as Last ReadFilePrevious 10Next 10PreviousNext