Processing Power of Voice Recognition Technologies Requires Enhancement for Continuous Speech Recognition Monday September 12, 4:30 am ET
LONDON, September 12 /PRNewswire/ -- Considered unviable until recently, the real-time speech recognition technology currently used in voice portals consumes immense processing power. The computation-intensive Hidden Markov Model (HMM) technology of the mid-1980s improved the ability of voice recognition devices to identify word relationships and ultimately led to the developing of powerful speech-recognition applications. For systems to understand and respond to continuous speech, manufacturers have to arrange for availability of a large amount of processing power. However, this will not be possible at reasonable costs.
When users speak at natural speed, it becomes difficult to associate specific sounds with particular words. Since users usually do not pause between words, processing naturally spoken phrases in real time can be tricky.
"Predominantly software-only engines demand more processing power than can be provided by traditional Digital Signal Processing (DSP) boards," notes VR Yoges, Frost & Sullivan (http://enterpriseapplications.frost.com). "These boards are used in Interactive Voice Recognition (IVR) systems and they need additional processors to supplement the IVR processing power as well as support and manage the system."
Nortel's modern speech-processing platform integrates technologies into a range of the Media Processing Server (MPS) platforms. The MPS systems configured with additional speech servers decrease the response time of a voice recognition solution.
The speech server is a speech-processing platform within an IVR/media processing platform offering choices, investment protection and scalability. The advanced system software developed on this platform integrates with industry-standard components to offer the advantages of open architecture systems.
"The design employs high-performance processors that plug into a separate resource subsystem integrated into the core operating architecture of the IVR/media server platform," says Yoges. "This approach provides a cost-effective and scalable resource for running advanced speech recognition and analysis."
Voice recognition systems also need to make allowances for the diverse enunciations and intonations of the same word by different people. The resultant issues of interpreting speech variability have led to the development of complex pattern analysis.
Apart from accents, voice recognition systems have trouble filtering out background noise - especially from calls made by mobile phone users. Although better microphones have remedied this issue to a small extent, wind, murmurs and music still require proper isolation from the voice.
To sort out these concerns, ScanSoft introduced the OpenSpeech(TM) Recognizer (OSR), a speech recognition solution for telephony applications. A prominent feature of this solution is its ability to enable applications in understanding a range of words and phrases without requiring highly complex grammar rules.
Innovations in automatic speech recognition (ASR), along with new solutions for missing or unreliable data, seek to create minimal fuss about noisy backgrounds and rely on clean speech. It is possible to obtain highly improved speech solutions using such models. This missing data approach to robust ASR works on the premise that when speech is one of the several sound sources, recognition is possible through some spectral-temporal regions that remain uncorrupted.
Since spectral features are sensitive to gender differences, it will be easy to analyse the differences in what the models have learnt about male and female speech patterns. Grammar constrains the recognition hypotheses and decides on a sequence of male or female models.
Researchers in the University of Sheffield discussed four system variants. They found discrete signal-to-noise ratio (SNR) masks based on estimates of local SNR. The first ten frames in the spectral amplitude domain averaged to form a stationary noise estimate. Subtracting this value from the noisy signal forms clean signal estimates.
"The high threshold here offers a safety margin reducing the impact of the errors introduced by a poor fitting," observes Yoges. "Softmarks SNR, in contrast, has fuzzy interpretation, allowing more points to be let through without the damage caused by admitting noise outweighing."
If you are interested in further information about the analysis of advances in voice recognition technology, please send an e-mail to Magdalena Oberland, Corporate Communications, at Magdalena.Oberland@frost.com, with the following information: your full name, company name, title, telephone number, e-mail address, city, state and country. We will send you the information via e-mail upon receipt of the above information.
Background
Frost & Sullivan, a global growth consulting company, has been partnering with clients to support the development of innovative strategies for more than 40 years. The company's industry expertise integrates growth consulting, growth partnership services and corporate management training to identify and develop opportunities. Frost & Sullivan serves an extensive clientele that includes Global 1000 companies, emerging companies, and the investment community, by providing comprehensive industry coverage that reflects a unique global perspective and combines ongoing analysis of markets, technologies, econometrics, and demographics. |