SI
SI
discoversearch

We've detected that you're using an ad content blocking browser plug-in or feature. Ads provide a critical source of revenue to the continued operation of Silicon Investor.  We ask that you disable ad blocking while on Silicon Investor in the best interests of our community.  If you are not using an ad blocker but are still receiving this message, make sure your browser's tracking protection is set to the 'standard' level.
Pastimes : Ask Steve

 Public ReplyPrvt ReplyMark as Last ReadFilePrevious 10Next 10PreviousNext  
To: burn2learn who wrote (704)2/8/1997 4:53:00 PM
From: hpeace   of 4749
 
what we a yr ago

Abstract
Development of a full-featured, 586-class multimedia PC costing less than $1,000 requires elimination of some hardware without eliminating features or hurting performance. Cyrix Corp's 5gx86 CPU features the virtual system architecture (VSA) which is similar to virtual memory in that it provides a way to trap accesses to non-present hardware and provides the requested function with software. This technique allows the 5gx86 to overcome the performance problems associated with the three primary techniques used by PC makers to lower silicon costs: eliminating L2 cache, using unified memory architecture (UMA) to decrease the amount of DRAM needed, and moving the functions of peripheral devices into software. VSA's system-management mode (SMM) allows the 5gx86 to save the machine state when a system-management interrupt is received and switch to a new context, SMM, where software can run without interfering with the operating system or applications.

A major challenge confronting the PC industry is the continuous need to add to feature sets and performance while simultaneously driving the system price ever lower. The tension between feature set and price has been present since the birth of the personal computer, but it is growing sharply as multimedia features become standard and as the PC approaches consumer price points.

Most system companies aim to field a full-featured, 586-class multimedia PC priced under $1,000. The only way to build such a machine is to eliminate hardware, but without scrimping on features or degrading performance-no mean trick.

In the first half of 1996, Cyrix will move the industry closer to this goal by introducing the 5gx86, a CPU capable of a new technology called virtual system architecture (VSA). VSA is similar in concept to virtual memory. It provides a mechanism for trapping accesses to non-present hardware and providing the requested function with software.

With the help of this technique, the 5gx86 overcomes the performance problems that plague the three basic strategies PC makers have devised for lowering silicon costs: eliminating L2 cache, using a unified memory architecture (UMA) to decrease the amount of DRAM needed and moving the functions of peripheral devices into software. These three strategies all have considerable merit, but they also suffer from significant performance and compatibility problems.

Certainly many PCs, even fifth-generation systems, have shipped without L2 caches. In fact, the L2 cache is optional on almost every motherboard. Unfortunately, however, systems without an L2 cache suffer a serious performance penalty. A well-designed motherboard with high-speed extended-data-out (EDO) DRAM gets the performance penalty down to about 15 percent, but a typical system without an L2 suffers around a 25 percent performance hit.

The currently proposed UMA approaches also have performance problems. In UMA systems, the CPU and graphics contend for DRAM bandwidth, thus increasing the average time it takes the CPU to access DRAM and cutting performance by 15 to 25 percent.

Finally, moving peripheral hardware functions into new applications programming interface (API) software potentially decreases system performance by taking cycles away from application software. An equally troubling problem is that new APIs are tied to a specific operating system, ignoring backward compatibility with legacy software.

Thus, each of the three cost-reduction strategies has problems, though all show promise if these problems can be overcome.

Cacheless systems

The performance problems of systems without an L2 cache arise from the relatively long amount of time it takes to access DRAM. A CPU in a fairly advanced PC with a 66-MHz system clock accesses DRAM in six clocks, assuming the access hits an open DRAM page. That translates into an access time of 90 ns.

A DRAM data book, on the other hand, specifies a page-hit access time of 35 to 40 ns, less than half the time actually observed in real systems. The reason it takes so long to access DRAM in a traditional PC is that the standard system architecture inherently wastes time.

Consider a typical read cycle: The CPU, running in a clock-multiplied mode with an internal frequency of 100 MHz plus, has to synchronize the memory-access request to the system clock. Synchronization consumes a core clock or two, unless by chance the memory access is perfectly aligned with the external clock.

Next, the access is driven on the bus pins at the beginning of a system clock cycle and sampled by the chip set at the end of that cycle. The request then flows through the chip set, where the DRAM address/control lines are driven.

Finally, the DRAM returns the read data through the chip set and back onto the system bus, where it is sampled by the CPU at the end of the next system clock cycle. Much of the time in this sequence is spent in useless delays: synchronizing to the external clock, driving the request to the chip set and waiting for the data to come back through the chip set.

An alternative that eliminates the delays and maximizes performance of cacheless systems is to integrate the memory controller into the CPU. When the CPU needs data, it drives the DRAM signals at the next available core clock edge, rather than waiting to synchronize to an external clock domain. Likewise, data returned by the DRAM is sampled at the end of a core clock.

Using a very-high-frequency clock to drive and sample DRAM lines enables memory-access timing that closely matches the theoretical performance of the DRAM. Accesses that hit open DRAM pages can achieve DRAM access times of 35 to 40 ns, getting data to the CPU in less time than the 45 ns it takes to get data out of a pipelined burst SRAM in a standard system with a 66-MHz system bus.

The second cost-reduction strategy is to trim the amount of DRAM in a PC by using a UMA, wherein system memory and the graphics frame buffer share one DRAM array. The principal problem with this scheme is the performance loss from sharing DRAM bandwidth between CPU accesses and graphics refresh.

Refreshing a 1,024 x 768 x 8 screen consumes at least 57 Mbytes/second of bandwidth, thus reducing the bandwidth available to the CPU by a significant fraction of the realizable DRAM bandwidth. From the CPU's point of view, the contention makes the DRAM appear slower than it should be, reducing performance by about 20 percent.

One way to eliminate the UMA performance problem is to reduce the bandwidth consumed by the display refresh by a factor of 10 or more. Such a reduction can be achieved by using advanced lossless compression hardware to create a compressed version of the frame buffer, and then service display refresh from the compressed data.

This method requires the refresh-controller portion of a UMA system's graphics unit to contain hardware that losslessly compresses graphics data as it is read from the frame buffer during a screen refresh. If a given line of the screen can be compressed to some threshold, the compressed version of the line is written back to a separate, compressed frame buffer also stored in DRAM. Subsequent screen refreshes are sourced from the compressed data and run through a decompressor before heading to the display.

Thus, two frame buffers are maintained in a system using display compression: Applications and drivers write and read data from the uncompressed frame buffer, maintaining software compatibility; the compressed frame buffer is used solely for refreshing the display.

Whenever an application writes to a portion of the frame buffer, tags associated with the modified lines are set to indicate that the lines need to be recompressed. The newly modified lines are read from the uncompressed frame buffer during the next screen refresh and compressed as they are displayed.

Reducing bandwidth

This compression mechanism reduces the bandwidth consumed by screen refresh during a typical GUI session by a factor of 12 or more, virtually eliminating the bandwidth-contention problem and thus the UMA performance problems.

Display compression is the cornerstone of good UMA performance, but an outstanding UMA system also needs advanced arbitration schemes to allocate bandwidth, extensive read and write buffering throughout the system, and integration of the memory and graphics controllers to reduce overhead in switching control of the DRAM bus from one chip to another. A system with a tightly integrated DRAM controller/CPU core and display compression offers excellent performance at a greatly reduced cost.

The third common strategy to reduce system cost is to migrate functions out of peripheral hardware and into software running on the CPU. A recent proposal from Intel Corp. promotes the concept of creating new APIs as a mechanism by which hardware functions can be implemented in software.

Moving functions from hardware into software has a long and distinguished history in the computer industry, but the specific-API approach has a few problems. Principally, they involve support for legacy devices that don't comprehend the new APIs, the performance problems inherent in the implied multilayered driver structure and the OS-specific nature of this solution.

The bottom line: Having a new API for audio doesn't remove the need for audio hardware, unless the API can work in every operating system the consumer may use.

System designers can address these problems with VSA. This "virtualization" software operates in a privileged context, completely invisible to applications and the OS. VSA makes it possible to virtualize almost any device regardless of the OS being run.

VSA is implemented with a greatly improved system-management mode (SMM). Upon receipt of a system-management interrupt, the CPU saves the machine state and switches to a new context, SMM, where software can run without interfering with the OS or applications. SMM is triggered if software attempts to access any non-present, virtualized device, and the access is handed off to software that provides the function implied by the access.

Using software to replace hardware can drive costs down, but if not done carefully it can drive performance down as well. CPU core enhancements, such as new instructions and L1 cache modifications, can reduce the performance impact to an acceptable level. With a VSA-capable CPU and suitable core enhancements, an industry-standard PC audio card can be cut down to a codec and a few kbytes of code.

Cyrix's 5gx86 device integrates a VSA-enhanced 586-class CPU core with an advanced DRAM controller and graphics accelerator. The 5gx86 CPU core is optimized for running VSA software. These optimizations start with an improved SMM that has far lower entry/exit overhead and the ability to nest interrupts.

Other features reduce the execution time of the virtualization code itself. Any instruction whose operands hit in the L1 cache executes without pipeline stalls, as though its operands were in registers.

Moreover, a small portion (0 to 4 kbytes) of the 16K L1 cache can be statically or dynamically "locked down," so that the contents of the locked region will never be invalidated or evicted from the cache except under software control. Software that is aware of this feature can effectively extend the register set of the machine by storing variables in the locked-cache region.

The locked cache can also be used to store inner loops of performance-sensitive code, CPU state information pushed during SMM interrupts and, occasionally, portions of the stack. New instructions that have been added to the core are used by 5gx86 device drivers and VSA code to accelerate graphics and video software.

One of the principal cost advantages of a 5gx86-based system is the elimination of a separate graphics-controller and frame-buffer DRAM. The 5gx86 graphics pipeline is tightly coupled to a CPU core that has been enhanced to further accelerate other graphics operations.

Cyrix added instructions to the CPU to perform block transfers of data within system memory, and between virtual memory and the graphics pipeline or frame buffer. Such instructions are particularly useful for rapidly displaying text or bit maps stored somewhere in virtual memory, and for manipulating blocks of compressed data.

This mechanism is very flexible. For example, the graphics pipeline can operate in concert with a block-transfer instruction to perform accelerated rendering of bit maps in virtual memory.

With the 5gx86, Cyrix is overcoming the performance problems that plague cacheless and UMA systems. VSA provides a means to eliminate legacy hardware, giving the system designer the freedom to innovate without sacrificing compatibility. Together, VSA and the 5gx86 make it possible for PC makers to produce high-performance, multimedia-enhanced 586-class computers that can be priced at under $1,000.

******************************
Building a sub-$1,000 multimedia machine.
Electronic Engineering Times: Nov 13, 1995
COPYRIGHT CMP Publications Inc. 1995
******************************
Report TOU ViolationShare This Post
 Public ReplyPrvt ReplyMark as Last ReadFilePrevious 10Next 10PreviousNext