To: Scumbria who wrote (7147 ) 8/31/2000 2:24:21 PM From: EricRR Read Replies (2) | Respond to of 275872 Scumbria: this is a post from Paul DeMone on Ace's. He seems to be the most fervent Willy fan out there. What do you this of his arguments? > The P4's double clocked ALU's, alias rapid execution engine . While it is > quite fantastic that a 2 GHz P4, which was presented by Albert Yu on the > IDF, has a 4 GHz ALU, there is something I should point out. Integer > multiplication takes 12 clocks to execute, while this instruction takes > only four clocks on the PIII. If you compare the PIII with the P4, the P3 > executes the integer instructions with 50% less latency. In other words, > the Rapid Execution Engine is only "Rapid" for the most common, > most simple instructions (32 bit ADD). Intel's engineers haven taken the > words "make the common thing faster" to the extreme... What is up with this? Intel changes the ALU so that Wilma can execute two dependent additions or subtractions in little over half a nanosecond while a hen's tooth PIII/1000 takes two nanoseconds and people are moaning that other instructions aren't equally sped up or may have an extra cycle or two latency? Perhaps a lot of people had better crack open the H&P text gathering dust on their bookshelf and have a good long hard read on the section about comparative dynamic instruction frequency. Add/subs are not only by far the most common integer operation but at a microarchitectural level are essential for comparison and effective address generation. > Isn't Integer multiplication one of the more common instruction (not the > most common ok)? No. It is critical for some crypto algorithms but in the vast majority of code IMUL is used most frequently for effective address generation when dealing with arrays of records and 2+ dimensional arrays of scalars. Remember, most work is done in loops and in most loops dealing with arrays the IMUL can be strength reduced to addition (here it pops up again) by even a half smart compiler. > The trace cache can send only 3 microops to the rest of the pipeline. Is > that a bottleneck? I am not sure, but if one x86 instruction takes on > average 2 microops, then the pentium 4 can only do a maximum of 1.5 x86 > per clockcycle or 1.5 IPC. Now, benchmarks show us that it is hard to > obtain higher IPC than 1-1.1, but it seems to me that you should have a > bit more headroom. Especially when the CPU encounters a piece of code that > has a high amount of ILP (Instruction Level Parrallelism). The P6 averages around 0.9 IPC and about 1.4-1.5 uOPS per x86 instruction on SPECint95. This equates to 1.3 to 1.4 uOPs per cycle sustained throughput. There is plenty of room for improvement in IPC before the 3 uOPs per cycle flowing into the front queues becomes the bottleneck. > The Pentium 4 is a very interesting architecture, and the more we delve > into it, the harder it gets to understand how it will perform in different > applications and what trade-offs the engineers made. So feel free to > enlighten us.... I have grown very weary of debating against the same spurious and superficial arguments why Wilma is obviously "such a bad design" over and over again. This will be my last post anywhere on the topic until after Wilma's official introduction and real numbers from production parts and boards are released. I will structure it as a Q&A to address all the FUD thrown at Wilma. Q1) Willamette is so deeply pipelined compared to P6 (20 vs 12). Won't branch misprediction kill IPC? The 20 vs 12 factor is greatly misleading The static diagram of the P6 and Wilma pipelines does show this large difference. However, the diagrams hide the "expansion joints" in both pipelines where uOPs can be stalled waiting in queues. According to the P6 architectur manager the the EFFECTIVE number of pipestages in P6 is 15-20 for integer uOPs and about 30 for FP uOPs. We don't know how much expansion there is Wilma's pipeline but Intel has taken many measures to control it. The double speed ALU helps a lot (remember - at least half x86 instructions perform a data memory access and most of the EA calculations involve at least one addition). In addition, Intel has put a lot of work upfront in the pipeline to intelligently schedule uOP execution later on including eliminating uOPs whose results is dynamically determined not to be needs (uOP squashing). The biggest factor for uOP lifetime in the P6 is effective memory latency. As I will mention latter on this has improved significantly in Wilma. Therefore I say 20:12 or whatever similar ratio bandied about greatly exaggerates the real average branch misprediction penalty in cycles actually relevent in comparing Wilma and P6 IPC. Whatever the ratio is, the branch prediction algorithm in Wilma is greatly improved (33% fewer mispredicts according to current Intel info). Although currently undisclosed, I doubt it will be any less sophisticated as the combined local/global tournament predictor in the EV6 disclosed more than 3 years ago. Q2) Won't the 8 KB dcache in Wilma (compared to 16 KB in P6 and 64 KB in K7) really hurt performance? No. The L1 cache is a small part of the overall memory system in an MPU. In a chip like Wilma the L1 serves primarily as a bandwidth-to-latency "transformer" to impedance match the CPU core to the L2 cache and reduce the effective latency of L2 and memory accesses. The big performance loser is going off chip to main memory and the 256 KB L2 in Wilma is what is relevent to that, not the 8 KB dcache. The size of an 8 KB cache is insignificant compared to the scale of the Wilma die and could easily be larger. I think the reason it isn't larger is because Intel wanted to hit a 2 cycle load-use penalty and at the clock rate Wilma targets a larger cache would be a speed path. An 8 KB dcache has a hit rate of around 92% and an 32 KB cache around 96%. A Two cycle 8 KB dcache beats a much larger three cycle dcache for the vast majority of applications given the rest of the Wilma memory system design. The cache info from IDF is quite interesting. The L1 dcache can performa a 128 bit wide load and store per clock cycle. According to Intel the average cache latency of a 1.4 GHz Wilma is a little over half (55%) that of a 1 GHz PIII in ABSOLUTE TIME. On a clock normalized basis the memory latency is only 77% of P3. That is right boys and girls, a P6 memory access averages about 30% more *clock cycles* than a 1.4 GHz Wilma. How is that possible? Well it isn't just the smaller/faster L1. Intel borrowed a neat trick from the Alpha EV6 core. The Wilma performs load data speculation in its pipeline and assumes the load hits in L1. If it doesn't then it executes the cache fill and then uses a replay trap to rerun the load. An analogy with football is this is like a receiver running a downfield pattern and expecting the football over the shoulder sight unseen. It is a lot faster than stopping and waiting to catch the football before running downfield. Q3) Why didn't Intel beef up the FPU in Wilma more? Don't they care? The biggest boost to FP performance in Wilma comes from more than tripling the system bus bandwidth. They also doubled the width of the data paths to and from the L1 dcache. The Wilma can do a 4 way single precision SSE2 FP MUL/ADD every two clock cycles or 6 GFLOP/s peak at 1.5 GHz. Just the bandwidth improvement should increase clock normalized FP performance over P6 on things like 3D rendering (at least to the bit bucket until cards catch up ;-) and SPECfp. Peak flops might impress the people Steve Jobs targets with his G4 supercomputer ads but the pros know better. When Alpha designers went from the EV5 to EV6 core did they add to the EV5's one FADD pipe + one FMUL pipe design? No! They stuck with the same arrangement and focused on reducing cache latency and improving memory bandwidth, just like Intel did with Wilma. Result? The EV6 has 2 - 2.5x higher FP performance on a clock normalized basis than EV5*. I am not predicting Wilma will be anywhere near as big an improvement over P6 as EV6 was over EV5. I simply wanted to point out that it is hard to spot such outcomes squinting at block diagrams counting functional units ;-) *The EV6 is also OOO but that generally has minor impact on FP code. Q4) What about the benchmarks available on the net? The Wilma sucks even compared to a Duron! This a pre-release MPU and motherboard of uncertain pedigree. We do not know for sure how fast the MPU is clocked or what performance acceleration features inside the MPU are broken in this stepping or simply disabled. The MPU business is big business and future product information from the dominant player is a competitive information worth millions of dollars. The business interest of Intel are best served by releasing enough information to put a positive spin on the product but not reveal its full potential to: A) avoid hurting sales of current products (Osborne effect), B) alerting competitors who can adjust their product development, marketing, and pricing strategies to compensate, C) stage manage expectations - always suprise on the upside, and D) so the "holy sh*t!" buzz about the product hasn't diminished to old news status by the time it is on the shelves in quantity. Intel scored a monumental psychological victory 5 years ago over RISC vendors in the way they managed the PPro introduction. But in the months leading up to that they deliberately dampened down expectations about PPro for maximum impact. Draw your own conclusions. BTW, considering the timing recovery nuts and bolts that go into a 400 MHz system bus on both the MPU and chipset side of things it is likely that the bus and memory are running at full speed in these prerelease systems so the McCalpin Streams numbers are likely valid. The results are breathtaking and even exceeds the Alpha ES40E system with Tsunami chipset and 256 bit wide memory channels. Perhaps direct rambus has a role in the high end PC after all. When Intel pulls out all the chocks from under Wilma it should really roll along. Conclusion ---------- - the disparity in pipeline length between Wilma and P6 is exaggerated because people are looking at static diagrams not real life uOP lifetime data. The Wilma's branch predictor will likely cover much if not most of the IPC fallout from the hyperpipelining in Wilma. EFFECT: small to mid size minus. - the double speed add/sub ALU is useful bot for adds/subs as well as comparison and effective address generation. EFFECT: small plus. - I didn't get into it here but the Wilma is vastly more resourced to support the number of instructions and memory accesses in simultaneous flight. One of the biggest bottlenecks in P6 was the reservation stations. Expect that to get special attention. EFFECT: small plus. - memory latency. What can you say? Data has to get on and off chip. Wilma has more than three times the bandwidth of P6 and its access latency is nearly cut in two. EFFECT mid size to large plus depending on app. Remember guys, about 1 in 5 instructions is a branch and about half of those are conditional. Of the 1 in 10 conditional branches the P6 would mispredict 10% of them or about 1 in 100, IIRC. The Wilma reduces that by 33% so only about 1 in 150 instructions is a naughty conditional branch that mispredicts. Compare that to about 1 in 2 of all x86 instructions that perform a data memory access. That is a huge difference when it comes down to the effect on IPC. Paul (taking vow of silence)