<Execution units stall upon a load miss, so the maximum number of load misses is a direct function of the number of execution units.>
Huh? An internal instruction doesn't even reach an execution unit until the load miss is resolved. These instructions will wait at the reservation stations until it gets its operands.
Take, for example, a theoretical processor with one execution unit fed by a single reservation station. That RS can hold, say, up to four instructions. All four instructions could be stuck behind independent load misses, meaning that if the processor could, it would have up to four DRAM reads outstanding. Then as each instruction gets its data from a load, it can be sent through the execution unit. If you had to wait until the instruction reaches the execution unit before issuing a DRAM access, you'll be reducing your throughput by a factor of four.
<Texture load performance is a function of bandwidth, not latency. K7 systems should excel at this, because of the 256 bit DRAM bus. (A really good video card will provide texture memory on board.)>
Well, of course good video cards will provide texture memory on board, but you still have to copy textures into the on-board memory anyway, which is where AGP DMA mode comes in.
On a tangent, who said that K7 systems will have or even need a 256-bit SDRAM pipe? If anything, it won't need anything more than a 128-bit PC100 SDRAM pipe. Only those new Alpha 21264 systems need a 256-bit SDRAM pipe, but that's because they're running their processor buses at 333 MHz, not 200 MHz like the K7.
Tenchusatsu |