SI
SI
discoversearch

We've detected that you're using an ad content blocking browser plug-in or feature. Ads provide a critical source of revenue to the continued operation of Silicon Investor.  We ask that you disable ad blocking while on Silicon Investor in the best interests of our community.  If you are not using an ad blocker but are still receiving this message, make sure your browser's tracking protection is set to the 'standard' level.
Politics : Formerly About Advanced Micro Devices

 Public ReplyPrvt ReplyMark as Last ReadFilePrevious 10Next 10PreviousNext  
To: tejek who wrote (84502)12/31/1999 2:58:00 PM
From: Paul Engel  Read Replies (1) of 1573016
 
Tejerk - Re: "why don't Intel and AMD go this route as opposed to coming up with an Athlon or a cumine? "

Intel IS DOING this !

"Intel Corp. plans nothing less than total domination of the newly emerging market for net-centric information and Internet appliances, Web-enabled set-top boxes, smart phones both wired and wireless and network-connected handheld devices. And, oh yes, it will do its best to establish a commanding presence in the embedded 32-bit processor market as well as use the architecture to maintain its position as the leading supplier of input/output processors for use in servers, routers and switches.

The results are performance numbers in the 188-to-750 Mips range (150 to 600 MHz) vs. about 100 to 233 MHz for the SA100. But the most important aspect, said Heeb, is the much lower power dissipation despite the enormous increase in performance-about 40 to 450 mW, depending on operating frequency.


Paul

{=========================================}

techweb.com

New StrongARM muscle extends Intel's reach
Bernard Cole

Intel Corp. plans nothing less than total domination of the newly emerging market for net-centric information and Internet appliances, Web-enabled set-top boxes, smart phones both wired and wireless and network-connected handheld devices. And, oh yes, it will do its best to establish a commanding presence in the embedded 32-bit processor market as well as use the architecture to maintain its position as the leading supplier of input/output processors for use in servers, routers and switches.

Those are the plans Intel finally revealed for the 32-bit StrongARM architecture that it gained after its acquisition of parts of Digital Equipment Corp. Details of the architecture were revealed at the Embedded Processor Forum in San Jose, Calif., earlier this month.

Intel is pulling out all the stops, applying all of the process and architectural tricks it has learned with its X86-compatible processors to make the already blazingly fast Strong-ARM even faster and less power hungry, said Lawrence Pegrum, platform architect for StrongArm in the Computer Enhancement Group of the StrongArm and Bridges Division (Hudson, Mass.).

At the forum Jay Heeb, the Strong-ARM design-team manager, revealed some startling power and performance numbers. If sustained in production devices (available in sample quantities toward the end of this year) they could make it difficult for most other processor vendors to compete. "As the first ARM architecture done completely by Intel, the aim was to carry forward the performance lead that the StrongARM already has without affecting the underlying instruction set and legacy code," said Pegrum. This was done by retaining complete compatibility at the instruction-set architectural level but entirely reworking the underlying hardware.

The results are performance numbers in the 188-to-750 Mips range (150 to 600 MHz) vs. about 100 to 233 MHz for the SA100. But the most important aspect, said Heeb, is the much lower power dissipation despite the enormous increase in performance-about 40 to 450 mW, depending on operating frequency. In a 150-MHz, 188 Mips device power dissipation will be only 40 mW. In the previous architecture, at 100 MHz the SA would draw about 135 mW, which is still one of the best performance/power ratios in the industry. When sampling begins shortly after the end of the year a 150-MHz implementation will require only 0.75 V. At 1-V input, performance rises to 400 MHz and 600 MHz at 1.5 V.

"If you wanted ridiculously low power," said Pegrum, "we could push the voltage down even further and still retain a respectable frequency of about 30 MHz."

The new implementation, currently named Coyanosa, will comply with Version 5.0 of the ARM instruction-set architecture programming model, vs. Version 4.0 on the original Strong-ARM. However, the underlying hardware is totally different. One major change is that the new design has a seven-stage integer and an eight-stage memory pipeline compared to five each in the earlier architecture.

In the new devices, said Heeb, the memory pipeline is identical to the integer pipeline except for the addition of a branch at the back end for the separate load and store instructions to make cache accessing more efficient. "In the previous five-stage integer-memory pipeline scheme," said Pegrum, "data for different destinations used the same data paths with no parallelism.

"Moreover, when operations destined for memory were mixed with integer operations and there was a stall in one area the other came to a halt also."

Entirely new in the Coyanosa is the use of branch prediction. In the previous implementations of Strong-ARM, with no branch prediction, no attempt was made to make a guess when and where a branch would occur. "It had no contingencies for when bubbles occur in the pipe, stalling operations," said Pegrum. Also unlike the original linear five-stage pipeline with no reliefs, the new 7/8-stage pipeline makes extensive use of data bypassing. "When you have data that is being computed elsewhere that will be used in a subsequent operation, it was necessary on the previous StrongARM to finish operations on the data before it was available for another operation," said Heeb. Now, with bypassing, rather than going to the end of the queue and back down the pipeline, the data is recirculated back into the data flow at or near the point where it exited for use in some computation.

The new implementation also doubles the cache sizes for both data and instructions, from 16 kbytes respectively to 32 kbytes for both. While they are still 32-way set associative in organization, the caches add write-through to the previous, write-back-only scheme. In write-through, everything written into the cache is passed out onto the bus and to main memory. In the write-back scheme, data written into cache was delayed rather than being passed immediately to the main memory. "The benefit of the write-back scheme is that multiple writes can be performed before actually putting them on the bus to main memory, reducing the load on the processor," said Pegrum. "But if your application involves interacting with a number of external devices or events, this actually costs you in terms of latency.

"By writing the data through the cache directly to memory, this delay is eliminated."

The original architecture was used in some network computers as a high-speed Java engine, with the Java virtual machine in the instruction cache and the Java application byte code in the data cache. That allowed execution of the byte code at 100 to 233 MHz, avoiding the use of an external 33- to 66-MHz external bus. "With the much larger cache sizes, this strategy will be even more appealing. Not only can more application byte code be stuffed into the instruction cache, but a full version of the [virtual machine] could be stored in the instruction cache, " said Pegrum. And with 32 kbytes of data cache, a much wider range of applications can be considered.

"Previously, because interpreted code size varied widely across applications, only a very few could use this technique-relatively simple applications such as Java terminals or network computers," said Pegrum. "Now net-centric information-appliance designs can consider this an option."

Also increased is the size of an on-chip mini-data cache, from 500 bytes to 2 kbytes. Heeb said that it acts as sort of a flow-through cache for data that one knows is not going to be kept for a long time. The cache is useful in many applications, such as multimedia and networking, where the data is kept only until it is decoded and then flushed.

Improvements have also been made to the write buffer. Unlike the original design, which coalesced and sent out in one burst only adjacent entries, on the assumption that they were part of the same task or operation, the new scheme uses full coalescing, looking at the contents of the entire buffer and pulling together all related entries before bursting them out onto the bus.

While a significant percentage of the power dissipation improvements in the Coyanosa architecture came from a shift from the original, 0.25- to 0.35-micron process to 0.18-micron geometries, many improvements have been made to power management beyond the ability to operate over a 0.75- to 1.5-V range, Heeb said. In addition to the idle and sleep modes on the original architecture, the new architecture has a drowsy mode that uses leakage-suppression circuitry to allow the processor to retain the entire processor state with no more than 0.1 mW of power-50 times less power than in the idle mode.

Wakeup has also been improved. Where as the earlier SA required several processor cycles to become fully operational, the new SA architecture returns to normal after only 30 microseconds.

"We are aiming the architecture at network computers, Internet appliances and especially handheld information appliances," said Pegrum. "But we are not ignoring the Internet infrastructure. The new architecture is a viable candidate for many places in the network: backbone, servers, routers, switches and RAID, as well as modem banks."

Copyright ® 1999 CMP Media Inc.
Report TOU ViolationShare This Post
 Public ReplyPrvt ReplyMark as Last ReadFilePrevious 10Next 10PreviousNext