THE YEAR OF THE TIGER Introducing TigerSHARC
edtn.com
A New DSP Architecture for the Digital Convergence Infrastructure The TigerSHARC DSP architecture is optimized for telecommunications infrastructure applications. This includes the high computational requirements of large signal processing tasks such as 3G base station radio baseband processing, ADSL and xDSL. The architecture is also well-suited to Remote Access Server (RAS) modems and telecom voice coding. These applications achieve cost, power, and area synergies from multi-channel implementations where a single chip processes multiple channels. Static Superscalar Architecture The TigerSHARC is a Static Superscalar architecture. It incorporates many aspects of conventional Superscalar processors, including a load/store architecture, branch prediction, and a large, inter-locked register file. The term "static" is applied because instruction-level parallelism is determined prior to run-time and encoded in the program. All the registers are interlocked, supporting a simple programming model that is independent of implementation latencies and is fully interruptible. Branch prediction is supported via a 128-bit entry Branch Target Buffer (BTB) that reduces branch latency. Program code is stored in quad-word memory with no wasted space.
Key Features: Static Superscalar architecture optimized for Telecom Infrastructure € Eight 16-bit MACs per cycle with 40-bit accumulation€Two 32-bit MACs per cycle with 80-bit accumulation€Two 16-bit complex MACs per cycle€ Single cycle Add, Compare, Select (ACS) sequence in the Viterbi algorithm€Add-subtract instruction and bit reversal in hardware for FFTs €64-bit generalized bit manipulation unit
Two Billion MACs per Second at 250 MHz € 2 billion 16-bit MACs€500 million 32-bit MACs€12 GBytes per second of internal memory bandwidth for data and instructions
Flexible Programming in Assembly and C Languages € Support for IEEE compatible 32-bit floating point, 32-bit, 16-bit, and 8-bit fixed point€ Full support for signed, unsigned, fractional, and integer data types with optional saturation€ User defined partitioning between program and data memory€ 128 general purpose registers€ Algebraic assembly language syntax€ Supported by an optimizing C compiler€ Supported by ADI VisualDSP® Integrated Development Environment € Singe Instruction, Multiple Data (SIMD) instructions, or direct issue capability € Predicated execution for all instructions€ Support for non-aligned accesses to memory€ Fully interruptible
Scaleable Performance The architecture natively processes 8-, 16- and 32-bit data types. Native support for multiple data types allows the processor to scale the number of operations which can be completed in a cycle based on the length of the data type being processed. There are 2 computation blocks (CBX and CBY). Each contains a multiplier, ALU and 64-bit shifter. With the resources in these blocks, a single cycle executes eight 40-bit MACs on 16-bit data, two 40-bit MACs on 16-bit complex data or two 80-bit MACs on 32-bit data. With 8-bit data types, the architecture can scale performance to issue 16 operations from a cycle thereby executing 8 billion operations per second. In addition, the TigerSHARC is a register-based load/store architecture, where each computation block has access to a fully-orthogonal 32-word register file simplifying the programming task.
32-Bit Benchmarks Execution Time at 250 MHzClock Cycles1024-pt. Complex FFT (Radix 2)41us10,30050-tap FIR on 1024 Input 110us27,500Single FIR MAC2.2ns0.55
16-Bit Benchmarks Execution Time at 250 MHzClock Cycles256-pt. Complex FFT (Radix 2)4.4us1,10050-tap FIR on 1024 Input 29us7,200Single FIR MAC0.56ns0.14Single Complex FIR MAC2.28ns0.57
12 Gbytes Per Second Memory Bandwidth The TigerSHARC features a short-vector memory architecture organized in three 128-bit wide banks. Quad (128-bit), long (64-bit), and normal (32-bit) word accesses move data from the memory banks to the register files for operations. In a given cycle, four 32-bit instruction words can be fetched, and 256-bits of data can be loaded to the register files or stored into memory. The highly-efficient memory architecture can store 8-, 16- and 32-bit data in contiguous, packed memory. Internal and external memories are organized in a unified memory map. The partition between program memory and data memory is user-determined.
Integer ALUs Support Data Address Generation and More Two integer ALUs, namely, JALU and KALU, are available for addressing and pointer updates. They support circular buffering and bit reversal, and each has its own 32-word register file. More than simply data address generation units, both IALUs support general-purpose integer computations. The general purpose nature of the IALUs benefits the compiler and increases programming flexibility.
Four Instructions Per Cycle The computational resources are controlled by a sequencer that can issue up to four 32-bit instructions in parallel. One or two of these instructions can control more than one computational unit, saving on code size and power. The programmer has the flexibility to issue individual instructions to each of the computation units.
The sequencer supports predicated execution, where any individual instruction executes according to the result of a previously defined condition.
No Hardware Modes The architecture is free of hardware modes. This eliminates wasted cycles and simplifies compiler operation. The instruction set directly supports all DSP, image, and video processing arithmetic types including signed, unsigned, fractional, and integer data types. There is optional saturation (clipping) arithmetic for all cases.
Look for this to make a big splash this year rah rah rah
Regards Norden |