To: dougSF30 who wrote (225472 ) 2/7/2007 4:21:03 AM From: DDB_WO Read Replies (1) | Respond to of 275872 Doug - I made a short list of IPC improvements, which we'll see coming with K10 ("Barcelona"). But first let me answer your points:1. AMD won't have any 120W parts this summer, if DT is correct. They show them entering production in Q3, so I wouldn't expect them before October. That's something, where we have to rely on sites like DT.2. That aside, Intel has a 3GHz Clovertown they can deploy whenever they choose. This is also a question of production volume. If AMD adapts the power virus based TDP definition, then they'll yield some more parts fitting into this new envelope.3. Finally, I expect that Core2 will maintain an integer advantage clock/clock. Combined with a ~20% clock advantage, one can understand why Otellini is comfortable stating that Intel will maintain performance leadership. (And late this year, Penryn 45nm upgrades arrive, followed by the Nehalem death-blow in H208.) We don't know, what Otellini knows. But different motivations might lead to the same behaviour (emotional intelligence). One could be, that Otellini liked the market share gains in the server market and so he doesn't want to let his customers dive into uncertainty. However, regarding IPC.. Have a look at this list and think about the design efforts and if they'd be worth it, if most of the more general changes wouldn't cause IPC improvements of ~1% or more per core modification? * Comprehensive Upgrades for SSE - Dual 128-bit SSE dataflow - Up to 4 dual precision FP OPS/cycle - Dual 128-bit loads per cycle - Can perform SSE MOVs in the FP “store” pipe - Execute two generic SSE ops + SSE MOV each cycle (+ two 128-bit SSE loads) - FP Scheduler can hold 36 Dedicated x 128-bit ops - SSE Unaligned Load-Execute mode Remove alignment requirements for SSE ld-op instructions Eliminate awkward pairs of separate load and compute instructions To improve instruction packing and decoding efficiency * Advanced branch prediction - Dedicated 512-entry Indirect Predictor - Double return stacksize - More branch history bits and improved branch hashing * 32B instruction fetch - Benefits integer code too - Reduced split-fetch instruction cases * Sideband Stack Optimizer - Perform stack adjustments for PUSH/POP operations “on the side” - Stack adjustments don’t occupy functional unit bandwidth - Breaks serial dependence chains for consecutive PUSH/POPs * Out-of-order load execution - New technology allows load instructions to bypass: Other loads Other stores which are known not to alias with the load - Significantly mitigates L2 cache latency * TLB Optimisations - Support for 1G pages - 48bit physical address - Larger TLBs key for: Virtualized workloads Large-footprint databases and transaction processing - DTLB: Fully-associative 48-way TLB (4K, 2M, 1G) Backed by L2 TLBs: 512 x 4K, 128 x 2M - ITLB: 16 x 2M entries * Data-dependent divide latency * More Fastpath instructions – CALL and RET-Imm instructions – Data movement between FP & INT * Bit Manipulation extensions - LZCNT/POPCNT * SSE extensions - EXTRQ/INSERTQ, - MOVNTSD/MOVNTSS * Independent DRAM controllers - Concurrency - More DRAM banks reduces page conflicts - Longer burst length improves command efficiency * Optimized DRAM paging - Increase page hits - Decrease page conflicts * History-based pattern predictor * Re-architect NB for higher BW - Increase buffer sizes - Optimize schedulers - Ready to support future DRAM technologies * Write bursting - Minimize Rd/Wr Turnaround * DRAM prefetcher - Track positive and negative, unit and non-unit strides - Dedicated buffer for prefetched data - Aggressively fill idle DRAM cycles * Core prefetchers - DC Prefetcher fills directly to L1 Cache - IC Prefetcher more flexible 2 outstanding requests to any address * Shared L3 - Victim-cache architecture maximizes efficiency of cache hierarchy - Fills from L3 leave likely shared lines in the L3 - Sharing-aware replacement policy