Silicon Investor (SI) -- The First Internet Community

STOCKTALK

We've detected that you're using an ad content blocking browser plug-in or feature. Ads provide a critical source of revenue to the continued operation of Silicon Investor. We ask that you disable ad blocking while on Silicon Investor in the best interests of our community. If you are not using an ad blocker but are still receiving this message, make sure your browser's tracking protection is set to the 'standard' level.

Politics : Formerly About Advanced Micro Devices -- Ignore unavailable to you. Want to Upgrade?

To: Cirruslvr who wrote (49783)	2/16/1999 6:59:00 PM
From: Paul Engel	Read Replies (4) \| Respond to of 1570760

Cringe-a-Lot - Re: " Do you like posting hypocritical statements? " I try not to, but you sure do. Re: ". You stated that Dell, or some other Tier 1 OEM, was testing the K7 and wasn't impressed" That's right - have you seen any OEM's announce that they ARE IMPRESSED with the K7? I thought not - so I stand by my statement. Re: "You also know that Jim said the K7 might be "announced" in May." McMannis says lots of things. McMannis said AMD was going to make $0.28 last quarter - such great Inside Information ! AMD made only one-half of that ! He said the K7 was going into production in January of this year as I recall. Re: "You know the K7 hasn't been delayed by AMD from the late June release. " June release? Show me an AMD announcement of a June K7 release. And don't be hypocritical. Paul

To: Cirruslvr who wrote (49783)	2/16/1999 7:14:00 PM
From: Paul Engel	Respond to of 1570760

Cringe & AMD Investors - Here is the TEXT of AMD's ISSCC K7 paper. The graphics are not presented for obvious reasons. Paul {====================================} MP 5.4 A 7th-Generation x86 Microprocessor Steven Hesley, Victor Andrade, Bob Burd, Greg Constant, Jeffrey Correll, Matthew Crowley, Michael Golden, Nancy Hopkins, Saiful Islam, Scott Johnson, Rabbani Khondker, Dirk Meyer, Jerry Moench, Hamid Partovi, Randy Posey, Fred Weber, John Yong Advanced Micro Devices, Austin, TX The AMD-K7 (TM) processor is an out-of-order, three-way superscalar x86 microprocessor with a 15-stage pipeline, orga-nized to allow 500+MHz operation. The processor can fetch, decode, and retire up to three x86 instructions per cycle to independent integer and floating-point schedulers. The schedulers can simultaneously issue up to nine operations to seven integer and three floating point execution resources. The cache sub-system and memory interface minimize effective memory latency and provide high bandwidth data transfers to and from these execution resources. The processor contains separate instruction and data caches, each 64kB and two-way set-associative. The data cache is banked and supports concurrent access by two loads or stores, each up to 64b in length. The processor contains logic to directly control an external L2 cache. The L2 data interface is 64b wide and supports bit rates up to 2/3 the processor clock rate. The system interface consists of a separate 64b data bus. The processor uses a variation of the pulsed flip-flop as the basic latching element. In addition to its small latency (Tsu + Tcq), this topology incorporates complex logic in its first dynamic stage through the nMOS pull down network (PDN), as shown in Figure 5.4.1. To minimize hold time without significantly affecting yield, a statistical model based on local variation of devices is used to determine the smallest CLKPULSE width. The model limits overall yield fallout to 0.1% due to failure to capture data. Master-slave latches in non-critical paths eliminate hold time concerns and reduce power. The read-only memory (ROM) arrays are self-timed edge-trig-gered full-rail segmented structures. Each ROM array consists of one, two, or four 64b tall arrays connected by a super bit line. The ftill rail segmented architecture has comparable speed, power, and smaller area to a reference cell and static load design. The segmented and twisted bit lines reduce the coupling to aggressors. Full rail circuits eliminate the risk and complexity of matching circuits and races associated with small signals. The sub- 1 ns static random access memory (SRAM) arrays are single-cycle or pipelined. The single-cycle arrays dynamically decode the address and access the array in the same cycle. The first stage of decode is incorporated in the edge-triggered, self-reset-ting address flops. These address flops generate monotonic out-puts that drive two NAND decoders. The pipelined arrays are required in the data cache (DC) to support three cycle load latency. Cycle 1 is used by the load store unit. The majority of cycle 2 is for address steering and transport, as shown in Figure 5.4.2. Because there is no clock edge available, and any self-timed signal to match the worst case address path increases the cycle time, the address is statically decoded late in cycle 2. To compensate for the increased area for latching the decoded address, the scan logic and the clock pulse generator are removed from the address flops. A test clock is added for scan controllability of these flops, and the decoder is used as the pulse generator as shown in Figure 5.4.3. The penalty for the pipelined arrays is less than 1% in area and speed. The processor has two custom register files: the 88-entry, 90b, five read, five write, floating point register file (FPRF) and the 24-entry, 32b, nine read, eight write, combined integer future file and register file (IFFRF). Both register files avoid complex bypass circuitry by completing writes before reads occur. The FPRF decodes the write address in the previous cycle, so that the write naturally completes during read address decode. The IFFRF delays the read access until the write and tag comparisons complete. To reduce the routing and area cost, the write bitlines are single-ended. The low voltage writeability issue is solved by the three transistor configuration used for each write port, as shown in Figure 5.4.4. The phase-locked loop (PLL) operates with a 2.5V supply, internally regulated down to 1.6V to satisfy oxide voltage stress limits. A high precision bandgap circuit minimizes variation of this internal supply voltage. Given the limited voltage headroom and the high frequency target, the PLL is designed to maximize the voltage controlled oscillator (VCO) control range. To ensure minimum static phase error over the maximum VCO control voltage range, the charge pump is designed to regulate the UP current level based on the DOWN level. This avoids large current mismatches when the UP current source devices begins to exit saturation. The cycle compression (less than 25ps) is optimized at the expense of accumulated phase error (less than 1 ns) by setting the loop natural frequency low. The PLL clock is transported to the center of the chip. From the center of the chip, an eight-level binary tree distributes the clock to eight horizontal buffer slices. The final programmable drivers are connected to the metal-5 and metal-6 mesh grid in 66 columns across each buffer slice. The maximum RC simulated skew, shown in Figure 5.4.5, is 32 ps. The simulated process skew, due to channel length variation, is 96ps. The chip is designed with full scan. One scan chain is dedicated for programming of self-timed pulses in the macros. A 13N march C algorithm is used for the DC arrays, IC arrays, and register files. At low frequency, the DC and IC bit cells are tested for data retention. For debug, the chip also supports on-the-fly frequency variation, single-cycle step operation, and stop mode. For PLL characterization, a scan chain, two high-speed pads, and one analog pad are used for extensive measurements of critical clock phase relationships and sub-block operations. The die is 1.84 cm2 and contains 22M transistors. Table 5.4.1 shows the technology features. C4 solder-bump flip-chip technology is used to assemble the die into a ceramic 575-pin BGA. Measure-ments are from initial silicon evaluation, unless otherwise stated.

To: Cirruslvr who wrote (49783)	2/16/1999 7:16:00 PM
From: Paul Engel	Read Replies (1) \| Respond to of 1570760

Cringe & AMD Investors - Here is the TEXT of AMD's ISSCC K7 FPU paper. The graphics are not presented for obvious reasons. Paul {====================================} MP 5.5 An Out-of-OrderThree-Way Superscalar Multimedia Floating-Point Unit Alisa Scherer, Michael Golden, Norbert Juffa, Stephan Meier, Stuart Oberman, Hamid Partovi, Fred Weber Advanced Micro Devices, Sunnyvale, CA The AMD-K7 tm floating point unit is implemented as an out-of-order coprocessor responsible for executing all x86 FPU, MMX tm and AMD 3DNow!TM instructions [1]. The FPU interfaces to the AMD-K7 core, which sends it instructions, load data, and guides the retirement of instructions. The FPU sends store data and completion status back to the core. Figure 5.5.1 shows a block diagram of the FPU. The FPU contains 2.4M transistors on a 10.5 x 2.6mm2 die in a 0.25 um process. A micrograph of the FPU is shown in Figure 5.5.2. FPU control consists of an in-order front-end that decodes and maps x86 instructions to internal execution ops. A central sched-uler dispatches execution ops into execution pipes when their source operands are available. Pipe tracking logic reports comple-tion status to the core. A retire queue holds all in-flight ops, maintains the register freelist, and updates architectural state as ops are retired by the core. The front end decodes up to three x86 instructions per cycle. This involves first mapping them into three operand format internal execution ops with the stack relative-register references con-verted to absolute registers. Complex x86 FPU instructions (e.g. transcendentals) are received from the core directly as a series of microcoded ops which are easily mapped into FPU execution ops. Absolute register numbers of execution ops are renamed into physical register numbers, using a mapper for the source which provides the most recent physical register mapped to a given absolute register, and obtaining destination physical register numbers from the freelist. The last stage of the in-order front-end inserts renamed execution ops into a 36-entry scheduler. Ops are issued from the scheduler when their source registers are ready and the required execution resources are available. Sources may come from the register file, be bypassed directly from one of three result buses, or in the case of memory operands, come from one of two load operand buses. Each source in the scheduler only has to snoop a maximum of three buses since it is determined ahead of time whether to snoop the three result buses or the two memory operand buses. Once an op is issued, it proceeds to read the register file and then enters the appropriate execution pipe. The scheduler employs a compaction scheme that allows space to be freed in the scheduler as ops are issued to the functional units. On completion of an operation, the result is written to the destination register, and completion status is sent to the core, which enables the core to retire the op. The retire queue holds up to 72 speculative ops and is responsible for updating architectural state and placing the old destination registers back onto the freelist when the ops are retired. The FPU contains three execution pipelines: add, multiply, and store. The add pipeline computes all x87 FP addition, subtraction, and compare operations, as well as MMX integer ALU operations and 3DNow! FP additions. The store pipeline processes true store operations, along with several special operations to support microcode routines. The multiply pipeline computes all x87 FP multiplication, remain-der, division, and square-root operations, MMX integer AIJU and multiplication operations, and 3DNow! FP multiplications. The 76x76b multiplier employs radix-8 Booth encoding to generate 26 partial products in the first execution cycle (Figure 5.5.3). A binary tree of 4-to-2 compressors reduces partial products to two, after which a rounding constant is added through two parallel (3,2) carry-save adders. These results are carry-assimilated in the third cycle, and the result is chosen in the fourth cycle. Division and square root use a quadratically-converging multiplication-based algorithm, and they share the FP multiplier, forming the con-straints on the dimensions of the multiplier. FP multiplication operations are able to fill unused cycles during a division or square root to provide maximum execution bandwidth. The latency and throughput of the various operations are shown in Table 5.5.1. The floating-point register file holds 88 words, each 90b. The register file has five read ports and five write ports that operate simultaneously in a clock cycle. A pulsed write enable signal and flopped write data are driven directly to the register cells without any intervening logic. The register bit cell, Figure 5.5.4, uses pull-downs on both sides of the cross-coupled-inverter storage node to speed up the write without use of pMOS devices. This fast write permits write-through of write data to a read in the same cycle from the same register without special bypass circuitry. This contrasts with designs using single-ended writes which do not allow write-through, especially at lower voltages [2]. The register file is self-timed and self-resetting, and relies only on the falling edge of the clock to start the timing chain. Each read port of the register file is separately enabled to reduce power dissipation when data is unchanged. Similarly, each func-tional unit is enabled only when performing true computation. Power is managed in the control queues by using valid bits to control conditional flip-flops, reducing power dissipation during periods without valid instructions. The FPU uses a variation of the pulsed flip-flop as principal latching element [3]. In addition to its small Tsu + Tcq, this topology incorporates complex logic in its first stage using a dynamic pull-down network, a feature utilized throughout the FPU to improve timing of critical paths. Figures 5.5.5 and 5.5.6 show the enabled and the 4-way mux enabled flip-flops. The muxed flip-flop is functionally equivalent to the basic flop preceded by a multiplexer but improves critical path latency by up to 12%. Due to the dynamic nature of the first stage of the flip-flop, coupling to its inputs must be tightly controlled. To this end, a CAD tool determines the minimum allowable input signal strength based on the driver distance from the flip-flop. Acknowledgements: The authors acknowledge technical contributions of M. Roberts,. Trull, M. Achenbach, J. Fan, M. Gulati, C. Keltcher, P. Lam, M. Siu, and J. Tseng. References: [1] Oberman, S., et al., "AMD 3DNow! Technology and the K6-2 Microproces-sor," Proceedings of Hot Chips 10, pp. 245-254, Aug., 1998. [2] Gieseke, B., et al., " A 600 MHz Superscalar Microprocessor with Out-of-Order Execution," ISSCC Digest of Technical Papers, pp. 176-177, Feb., 1997. [3] Partovi, H., et al., "Flow-Through Latch and Edge-Triggered Flip-Flop Hybrid Elements," ISSCC Digest of Technical Papers, pp. 138-139, Feb., 1996.