Speech recognition tuned for interfaces SATURDAY, NOVEMBER 06, 1999 2:00 AM - CMP Media
Nov. 05, 1999 (Electronic Engineering Times - CMP via COMTEX) -- The system-on-chip (SoC) era is creating dramatic new opportunities to implement speech recognition as an integral part of new system design. In embedded applications, speech is used as a command and control interface in applications ranging from automotive environments and entertainment controls, to self-dialing cellular phones, and even robot pet toys.
Vocabularies of systems limited to around 1,000 words or perhaps 100 sentences must be carefully selected to support the target application. With the addition of programmable flash memory, the system can be trained by downloading speech pattern templates generated on a PC. This new capability for embedded products can help to achieve significant reductions in systems complexity. However, unlike PC-based dictation, the embedded system must implement more powerful filtering to separate the speech signal from environmental noise.
The flow of speech recognition in an embedded system consists of three stages. First, the acoustic analysis stage occurs when a spoken acoustic signal is converted into a feature space representation frame by frame. The feature space coefficients are often some kind of representation of the speech spectrum.
Second, the dynamic information between speech frames is then analyzed via some kind of frame-by-frame pattern matching or sub-word analysis. Third, the final decision is made within a possible candidate list, generated from the second stage, aided by a knowledge base of a particular language.
After two initial process stages that require extensive digital signal processing (DSP) capability, the final recognition of spoken words or phrases is more of a state machine function, requiring a powerful microcontroller.
Not surprisingly, the typical design of an embedded speech recognition system includes dedicated DSP, either as a standalone chip or core, matched with a RISC processor to manage the embedded system and data flow. That design approach has numerous shortcomings, including dual-processor synchronization issues and the requirement for development using two separate tool sets. Additionally, DSP architectures are not C-language friendly.
A speech recognition engine that seamlessly combines both RISC and DSP capabilities, such as the TriCore Unified Processor architecture developed by Infineon Technologies, should logically be capable of exceeding the performance of alternative approaches, while providing additional benefits in system design and development. In fact, Fonix Corp. (Salt Lake City), a speech recognition company, uses a TriCore-based processor as a demonstration platform.
To verify the benefits of the unified processor architecture, we evaluated system performance in three stages of the speech recognition flow. Our testing shows that, in the speech signal analysis stage, the unified processor performed an average of 1.87 times faster than a DSP-in this case, an Oak core DSP. In additional tests, a C-language implementation of the pattern-matching stages outperformed an ARM-7 microcontroller by a factor of 1.11 to 1.28.
The basic test methodology was to implement typical algorithms used for speech recognition systems on a RISC+SIMD (TriCore-based) machine, and compare the results with either a dedicated DSP or RISC implementation. Stage 1 algorithms include IIR and FIR filters; real and complex number fast Fourier transform; and autocorrelation algorithms (for LPC analysis and echo/noise cancellation). From an algorithm point of view, neural networks can be treated as concatenating nonlinear components and filters. Stage 2 algorithms include Spectra Distortion Measure algorithms; dynamic programming; and Viterbi algorithm (for solving Hidden Markov chain). The Stage 3 algorithms include SLR(1) parser, tree, and table search algorithms.
It is important to note the DSP algorithms for the first two stages of speech recognition are mature enough to be implemented as assembly libraries. These libraries, consisting of DSP algorithms in stage 1 and mixed DSP and constrained shortest-path algorithms in stage 2, were used in our analysis. The third stage of the process combines optimization algorithms with state machine/database search algorithms. These are still evolving, and thus were implemented in C language.
The performance advantages seen in this comparison arise from a number of features of the system architecture. The TriCore-based machine is a single-instruction, multiple-data (SIMD) RISC processor, with an instruction set that combines classic DSP functionality and control instructions. Infineon was the first major semiconductor company to implement this type of unified SIMD extended-RISC machine.
Register-based RISC machines are more pipeline- and compiler-friendly, due to the separation of memory access from arithmetic operations and simplified addressing mode. Those features of RISC help reduce the overall system cost/performance ratio and time-to-market. For example, RISC machines concentrate on instruction execution time via pipeline friendliness. Traditional DSP concentrates on doing more with one instruction. To encode more operands for complex instructions, DSP machines use either some type of default register scheme or a very long instruction word (VLIW) architecture, leading to higher system cost and poor compiler support.
The SIMD architecture also boosts processor performance by taking advantage of bus bandwidth and packed operations. With the same instruction size, it is possible to encode two or four times the number of operands with very small additional cost.
While these advantages of a RISC+SIMD architecture seem to make it ideal for system-on-chip design, a number of issues had to be overcome to implement this type of architecture effectively. These issues include the challenge of switching between DSP mode and state machine mode fast enough to perform as if two independent processors were working together seamlessly.
Also, the architecture must be able to maintain one-cycle, two-data media-access control (MAC) operation, which is fundamental for dot multiplication in DSP algorithms. By using a 128-bit internal bus and a link-list data structure, the TriCore Unified Processor switches processor context within two cycles.
By: Howard Shi, Senior Staff Engineer, Daniel Martin, Senior Architect, Karl Westerh Copyright 1999 CMP Media Inc.
|