To: peter_luc who wrote (73824 ) 3/7/2002 12:44:38 PM From: pgerassi Read Replies (4) | Respond to of 275872 Dear Peter Luc: Paul DeMonde has been an Intel booster. His opinion that P4 would be the greatest thing ever before it came out was not shown to be true. Even now, he continues the Intel party line that "The next (CPU) will be far better than any competitor's (CPU)". Always with him it is "wait till next time!" Of course he is prone to suffer continual credibility loss (if he has any left to lose). It is just like him to denigrate AMD successes. Athlon has 10 stages and runs just fine to 1.8GHz and now he thinks that Clawhammer needs 18 stages to get to 1.8GHz. He is soundly mistaken since Athlon gets to 1.73GHz with just 10 stages and adding 2 more at 0.18u would get it to 2.07GHz assuming it is at least as well balanced as Athlon. With 28 stages, P4 at 0.18u only got to 2GHz or 70MHz a stage. He must think AMD can not do as well as it has proven to. Otherwise, he would have to admit how badly he and Intel were mistaken by their poor design decisions. The real reason Intel needs those repeaters (or boosters if you like that term better) is that they are trying to have 50% more clock on a chip twice as big. That means that they need two stages to move data around while AMD only needs one. This shows that the high clock solution does not work beyond a given point and 28 stages is far beyond that point for general purpose computing. In addition, the problems he aludes Hammers to have in decoding should tell him that P4 has an even worse bottleneck in decoding than Athlon. I notice that Intel doesn't believe decoding is a problem since they devote even less resources than AMD. The trace cache is too small anyway for server type jobs. All of this is understandable in retrospect since, Intel thought IA-64 would conquor all in the server space since optimizing compilers are far easier than optimizing hardware. They have failed at this "simple" task for the last 7 years (they have a lot of company with IBM, TI and Motorola all trying this VLIW approach and fixing it with the compiler). All successes with VLIW have come on restricted embedded applications where hand tuned assembly is the norm for development. If IA-64 had succeeded, P4 would not need to do serving. And it was never designed to be a good server CPU. You can tell this by the low way caches, small cache sizes especially L1, low latency caches, long cache lines all point to simple data streaming tasks rather than highly interactive serving of large numbers of very different tasks that usually makes up server loads. This is why P4 gets trounced by both P3 and Athlon in server type loads. That is why most of the benchmarks are for single (or a small number) tasks that use P4 opts but, not Athlon opts. Otherwise, P4 gets burned badly. Hammer will only further excerbate this problem of P4. Many Intellabees want Hammer to not be a server CPU. They refuse to consider it one because of the disaster named IA-64 is shown in even a worse light. P4 with one decoder at half speed (you can see that P4's decoder is the same as a P3 (complex vector) decoder pipeline at half speed (all stages match and even the optimization documentation makes indirect reference to this)) is far worse than Athlon's 1 complex and 2 simple decoder pipelines at full speed and is snowed under by Hammer's 3 complex symmetric decoders at full speed. DeMonde's rant should be directed at P4 long before he turns it on Hammer. He also fails to realize that the business end of all P3s, P4s, Athlons and Hammers are all RISC based engines. And the x86 "cruft" is far smaller than the IA-64 "compiler cruft" and P4s half speed third wide decoder pipe can't digest it well enough. He needs to take a big step backward, shake out the Intel FUD and take a good dispassionate look at it again. Intel took three detours from the well known road to enhanced performance. Every one has been like that commercial where a car is running down a freeway and ends up at a dead end dirt road in some menacing woods. They are "increase bandwidth, latency and cost" RDRAM, "super long pipeline data streaming tiny cache CPU" P4 and "simple in order VLIW hardware with highly optimizing general purpose compiler" IA-64. AMD and the rest have gone the well understood "increase bandwidth but, lower latency and cost" DDRSDRAM, "out of order hardware optimized well balanced wide pipeline large cache CPU" Athlon and "out of order optimized well balanced wider slightly longer pipeline compatible with x86 flattened 64 bit register set 64 bit addressing on die memory controllers and HT switched hardware with normal compilers and applications" Hammer. I seem to like the latter much more than the former and history bears me out. Pete