To: Tenchusatsu who wrote (93884 ) 2/17/2000 3:39:00 PM From: kash johal Respond to of 1571355
Tench, more speculation on the 3Ghz ALU's. This guy seems very knowledgeable to me: Superpipelining speculation - looks intriguing if done right Posted By: Paul DeMone <pdemone@igs.net> Date: Wednesday, 16 February 2000, at 10:16 a.m. In Response To: Re: clock rate uber alles (plethora) > what do you think about the potential ability of a trace cache to increase > IPC? depending on the trace cache hit ratio it seems like IPC should > increase significantly. is this not the case? Well my understanding of trace caching is that it is like the idea of pre-decode bits taken to the extreme. Pre-decode analyzes the instruction stream during i-cache misses and adds a few bits per instruction byte to the i-cache to show demarcation between instructions and probably some info about instruction type and extensions etc. The idea here is to do some of the hard work decoding an x86 instruction byte stream ahead of time to reduce the decoding work done during normal execution of code from the i-cache (and thus the number of pipe stages on average). Trace caching goes beyond that by examining the relationship between adjacent instructions during i-cache miss fetching and trying to group them together in small parcels if they can be simultaneously launched to separate functional units and don't have any interdependencies. This further reduces the amount of instruction decode and dependency checking that the processor has to perform in normal operation when it fetches these code parcels from the i-cache and executes them. This effort is primarily directed at simplifying the execution pipeline and permitting high clock rates. The IPC issue is distinct and not necessarily tightly coupled into the use of trace cache techniques. > the 20-stage pipeline is certainly an issue as far as branch penalties go > and this would surely hurt IPC. but it seems logical to assume that > Willamette will have some type of excellent branch predictor to offset > this as much as possible. but i'm afraid i just don't know how good the > best branch predictors are these days (or will be in a year). This 20-stage pipeline figure may be a red herring if it includes the optional complex multi-cycle instruction => trace parcel conversion process. The normal instruction pipeline might be to fetch instruction parcels from i-cache and execute them and retire them in say 12 or 14 cycles. However, for an i-cache miss a separate 6 or 8 cycle sequence is inserted to perform the raw x86 fetch from L2 and conversion into a set of parcel(s) to replace the missing line from the trace i-cache. So in tight loops or code sequences with no i-cache misses the first 6 or 8 stages of the hypothetical 20 stage pipeline are effectively skipped. People interested in this should look up the MPR story about the HAL SPARC V processor with trace caches from last Dec or Nov. Now this double clocked ALU has me interested. Clearly building a 32 bit adder in an aluminum 0.18 process that operates at the 200 to 250 ps necessary for 3.0 GHz operation strains credibility. However, the four cycle latency figure makes sense if you consider the ALU is built superpipelined with multiple stages of dynamic logic with four of them being domino-like circuits with built in latches. So if this logic runs with a 3.0 GHz "MCLK" in a processor deemed running at 1.5 GHz then the 4 cycle adder requires 4 MCLKs (1.333 ns) less the timing overhead of four built-in latches to perform the add, something rather easy. Depending on how many if any internal bypassing stages there are in the ALU the 4 MCLK latency may be less for consecutive arithmetic instructions issued to the same ALU. This starts to paint an interesting picture of a CPU that might fetch instruction parcels at 1.5 GHz (call it PCLK) while feeding them to superpipelined integer ALU(s) at 3.0 GHz (MCLK). So each ALU can absorb two operations per PCLK. The interesting possibility is that two dependent integer instructions can be issued to the same ALU in the same PCLK but offset by 1 MCLK *if* the ALU is fully bypassed internally. So the code sequence: ADDL EAX,EBX SUBL EAX,ECX could appear to execute simultaneously at 1.5 GHz and PCLK latency 2 or 3 even though the second instruction has a RAW dependency on the first. So the effect of the "4 cycle" ALU on IPC may be positive rather than negative. Also one or two physical ALUs would look like 2 and 4 logical ALUs as far as instruction issue goes. If the Willy has two physical ALUs (as I would predict) the functional unit assignment logic in the parcel builder has strong incentive to keep dependent chains of instructions running in the same ALU (just like the issue logic in the 21264 pays close attention to reducing cross-IU cluster data traffic). So maybe 3 PCLK generate-use latency for crossing ALUs but *zero* latency for dual issue to the same ALU. Sweet! More intriguing, this technique applied to the FPU could also allow Intel to implement full rate 4x32 bit and 2x64 bit SSE2 instructions with just one 32 bit + one 32/64 bit execution pipelines (or double issue with 2 + 2 resources). It is also no suprise that the FP EXCH instruction is as unwelcome as a dead rat at a garden party, it buggers up all this nice, albeit hypothetical, fully bypassed superpipelining. " Source:aceshardware.com