SI
SI
discoversearch

We've detected that you're using an ad content blocking browser plug-in or feature. Ads provide a critical source of revenue to the continued operation of Silicon Investor.  We ask that you disable ad blocking while on Silicon Investor in the best interests of our community.  If you are not using an ad blocker but are still receiving this message, make sure your browser's tracking protection is set to the 'standard' level.
Technology Stocks : XYBR - Xybernaut -- Ignore unavailable to you. Want to Upgrade?


To: Sir Auric Goldfinger who wrote (3422)3/24/2000 6:53:00 AM
From: Wolff  Read Replies (3) | Respond to of 6847
 
CMU Wearable Computers for Real - Time Speech Translation Here is way way too much info on a seriously advanced competitor to XYBR. Bottomline is the CMU products that are comercially available...blow doors on the toy like XYBR systems.

I posted here because it was a slow loading PDF link
wolff
=========================
CMU Wearable Computers for Real - Time Speech Translation
Asim Smailagic, Dan Siewiorek, Richard Martin, Denis Reilly
Institute for Complex Engineered System, Carnegie Mellon University
Pittsburgh, PA 15213, USA
{asim, dps, martin+ }@cs.cmu.edu, dpr@andrew.cmu.edu,
Abstract
Carnegie Mellon?s Wearable Computers
Laboratory has built four generations of real - time
speech translation wearable computers, culminating in
the Speech Translator Smart Module. Smart Modules are
a family of interoperable modules supporting real-time
speech recognition, language translation, and speech
synthesis. In this paper, we examine the effect of various
design factors on performance with emphasis on
modularity and scalability. A system-level approach to
power / performance optimization is described that
improved the metric of (performance / (weight * volume
* power)) by over a factor of 300 through the four
generations.
1. INTRODUCTION
The goal of CMU?s Wearable Computer project
is to develop a new class of computing systems with a
small footprint that can be carried or worn by a human
and interact with computer-augmented environments. By
rapid prototyping of new artifacts and concepts, CMU
has established a new design paradigm for wearable
computers [1],[2]. Eighteen generations of wearable
computers have been designed and built over the last
seven and a half years, with most field-tested. One of the
application domains is real-time speech recognition and
language translation.
Bringing computing-intensive applications to a
wearable platform means that users have mobile access
to those applications at any time and any place. A well
designed wearable computer should make using
computing-intensive applications almost as easy and
intuitive as using a hand tool.
There are several criteria that can be of use
when designing a wearable system:
ú Keep the latencies involved with running the
operating system (OS) and the application low (as
close to ?instant response? as possible, such as a
flashlight).
ú Make the battery life as long as possible (reduce
power consumption)
ú Make the interface to the software as intuitive as
possible.
ú Make the form factor of the device as unobtrusive as
possible, specifically lightweight and operable in
multiple orientations.
The Smart Module project adds two more
criteria to wearable computer design. These wearable
devices must be modular; they should be usable in
different configurations. They must also be scalable;
existing code should be easily portable to the modules.
By using a known OS, the modules have the potential to
run a wide variety of applications supported by its
hardware. The OS chosen was Red Hat Linux, because it
is free, lightweight, scalable, customizable, and a variety
of applications already ran on the Linux platform.
This paper will focus on the first two goals of
improving performance and reducing power
consumption. These goals seem to be inherently
contradictory at first glance: any computing device that
runs at a high clock frequency will tend to consume more
power. This paper measures how close the Smart Module
project is to achieving these goals.
The use of speech and auditory interaction on
wearable computers can provide hands-free input for
applications, and enhances the user?s attention and
awareness of events and personal messages, without the
distraction of stopping current actions. It also minimizes
the number of user actions required to perform given
tasks. The speech and wearable computer paradigms
came together in a series of wearable computers built by
CMU, including: Integrated Speech Activated
Application Control (ISAAC), Tactical Information
Assistant (TIA-P and TIA-0), Smart Modules, Adtranz,
and Mobile Communication and Computing Architecture
(MoCCA) [3],[4],[5].
There have been several explorations into
wearable auditory displays, such as using them to
enhance one?s environment with timely information [6],
and providing a sense of peripheral awareness [7] of
people and background events. Nomadic radio has been
developed as a messaging system on a wearable audio
platform [8], allowing messages such as hourly news
broadcast or voicemail to be downloaded to the device.
Most of these prior systems have focused on speech
recognition and speech synthesis. Language translation
presents one additional challenge for wearable
computers.
2. EVOLUTIONARY METHODOLOGY
Since wearable computers represent a new
paradigm in computing, there is no consensus on the
mechanical/software human computer interface or the
capabilities of the electronics. Thus iterative design and
user evaluation made possible by our rapid
design/prototyping methodology is essential for helping
define this new class of computers.
The four generations of real - time speech
translation wearable computers span from general
purpose to dedicated computers: TIA-P, TIA-0, Speech
Translator Functional Prototype Smart Module, and
Optimized Speech Translator Smart Module. This
evolution was based on lessons learned from their field
tests and deployment. These four systems were
developed as two related pairs. The first member of each
pair was a functional prototype that was suitable for field
evaluation. The second member was optimized for
power consumption, size, weight, and performance. The
feedback from field tests guided the design of the next
version.
These systems had attributes which were the
same for all four as well as attributes which were varied
to achieve improved designs:
Constants
ú Speech Recognition (SR) / Language Translation
(LT) Application
ú Cardio Processor Subsystem
Variables
ú System and Software Architecture
ú User Interface
2.1 SR / LT Application
The SR / LT application is a speech translation
process which consists of three phases: speech to text
language recognition, text to text language translation,
and text to speech synthesis. The application running on
TIA-P and TIA-0 is the Dragon Multilingual Interview
System (MIS). It is a keyword-triggered multilingual
playback system, which listens to a spoken phrase in
English, proceeds through a speech recognition front-end,
plays back the recognized phrase in English, and
after some delay (~8-10 secs) synthesizes the phrase in a
foreign language (Croatian). The other, local person can
answer with Yes, No, and some pointing gestures. The
Dragon MIS has about 45,000 active phrases, in the
following domains: medical examination, mine fields,
road checkpoints, and interrogation. Therefore, a key
characteristic of this application is that it deals with a
fixed set of phrases, and includes one-way
communication.
The Speech Translator Smart modules
(Functional Prototype and Optimized) run a freeform,
continuous speech translation application, including two-way
communication. The modules use CMU language
translation and speech recognition software that was
profiled to identify "hotspots" for software and hardware
acceleration. TIA-P and TIA-0, as uniprocessor units,
would not be appropriate for this application and we
decided to proceed with a dual processor dedicated
architecture (smart modules) to decrease size and
response time. The first module incorporates speech to
text language recognition and text to speech synthesis.
The second module performs text to text language
translation.
2.2 Cardio Processor Subsystem
The core of all four speech translation wearable
computers is the Cardio processor card, which combines
the processor and many of the motherboard chips into
one package, about the size of a PCMCIA card [12]. The
hardware architecture of the modules is illustrated in
Figure 1.
All the necessary signals for the ISA and IDE
buses come out of the Cardio card. The Cardio also
supports two serial ports, which are used for
communication between the modules, and a VGA
interface. The ISA and IDE buses both typically operate
at 8 MHz, with a width of 16 bits. The ISA bus is limited
to 8 MB/s throughput, while the IDE interface can
achieve up to 13 MB/s throughput. Main memory is
significantly faster ? although the Cardio data sheet [9]
does not have complete information on the internal
memory bus of the Cardio, a reasonable estimate is that
the 133 MHz 586-based Cardio has at least a 33 MHz
system bus with a width of 32 bits. Even with a wait
Fig. 1. Smart Module Hardware Diagramstate, the memory architecture is speculated to move 66
MB per second.
2.3 System and Software Architecture
The main difference in the system architecture
is due to TIA-P and TIA-0 speech translation application
being one-way, and smart modules perform two-way
speech translation.
Figure 2 depicts the structure of the free form,
two way speech translator, from English to a foreign
language, and vice versa. The speech is input into the
system through the Speech Recognition subsystem. A
user wears a microphone as an input device, and
background noise is eliminated using filtering
procedures. The Language Translation module includes
a language model, glossary, and machine translation
engine. The language model, generated from a variety of
audio recordings and data, provides a knowledge source
about the language properties. The Example-Based
Machine Translation (EBMT) engine translates
individual "chunks" of the sentence using the source
language model and then combines them with a model of
the target language to ensure correct syntax. The
glossary is used for any final lookups of individual words
that could not be translated by the EBMT engine. When
reading from the EBMT corpus, the system makes
several random-access reads while searching for the
appropriate phrase. In small wearable systems the
language corpus is stored on disk. Since random reads
are done multiple times, instead of loading large,
continuous chunks of the corpus into memory, the disk
latency times are more important than the disk
bandwidth.
The Speech Synthesis subsystem performs text
to speech conversion at the output stage. To make sure
that misrecognized words are corrected, a Clarification
Dialog takes place on-screen. It includes the option to
speak the word again, or to write it. As indicated in
Figure 2, an alternative input modality could be the text
from an Optical Character Recognition subsystem (such
as scanned documents in a foreign language), which is
fed into the Language Translation subsystem. The Smart
Modules software architecture is described in section 5.
Figure 3 illustrates the one way speech
translator, based on the Multilingual Interview System
(MIS) that has been jointly developed by Dragon Systems
and the Naval Aerospace and Operational Medical
Institute (NAOMI), and runs on TIA-P and TIA-0. The
user (interviewer) selects a domain module and target
language, then selects and speaks phrases from a set of
prerecorded phrases. The speech recognition system uses
Dragon Dictate. In the next step, matching of a
recognized phrase with the prerecorded phrase in a target
language is preformed (phrase book lookup), and the
prerecorded phrase is played back at the output stage
(speakers). The phrases are designed to elicit brief
responses (yes or no) or gestures.
2.4 User Interface
User interface design went through several
iterations based on feedback received during field tests.
The emphasis was on correct two - way speech
translation, and having an easy to use, straightforward
interface for the clarification dialogue.
3. TIA-P AND TIA-0
Our first two systems built in a family of
wearable computers dedicated to speech translation
applications were TIA-P and TIA-0.
3.1 TIA-P
TIA-P is a commercially available system,
developed by CMU, incorporating a 133 MHz 586
processor, 32MB DRAM, 2 GB IDE Disk, full-duplex
sound chip, and spread spectrum radio (2Mbps, 2.4 GHz)
in a ruggedized, hand-held, pen-based system designed
state, the memory architecture is speculated to move 66
MB per second.
to support speech translation applications. TIA-P is
shown in Figure 4. TIA-P supports the Multilingual
Interview System.
Speech translation for one language (Croatian)
requires a total of 60MB disk space. The speech
recognition requires an additional 20-30 MB of disk
space.
Dragon loads into memory and stays memory
resident. The translation uses uncompressed ~20 KB of
.WAV files per phrase. There are two channels of output:
the first plays in English, and second in Croatian. A
stereo signal can be split and one channel directed to an
earphone, and the second to a speaker. This is done in
hardware attached to the external speaker. An Andrea
noise-canceling microphone is used with an on-off
switch.
TIA-P has been tested with the Dragon speech
translation system in several foreign countries: Bosnia
(Figure 5), Korea, and Guantanamo Bay, Cuba. TIA-P
has also been used in human intelligence data collection
and experimentation with the use of electronic
maintenance manuals for F-16 maintenance.
Operational Experience
The following lessons were learned during the
TIA-P field tests: wires should be kept to a minimum;
handheld display was convenient for checking the
translated text; standard external electrical power should
be available for use internationally; battery lifetime
should be extended; ruggedness is important. All these
lessons were used as an input into the design of the
optimized version, TIA-0.
3.2 TIA-0
The main design goals for the TIA-0 computer
were shrinking the size, reducing the weight, and
incorporating the lessons learned from the TIA-P field
tests. TIA-0, shown in Figure 6, is a smaller form factor
system using the electronics of TIA-P. The entire system
including batteries weighs less than three pounds and
can be mission - configurable for sparse and no
communications infrastructures. A spread-spectrum
radio and small electronic disk drive provide
communications and storage in the case of sparse
communications infrastructure whereas a large disk drive
provides self-contained stand-alone operation when there
is no communication infrastructure. A full duplex sound
chip supports speech recognition. TIA-0 is equivalent to
a Pentium workstation in a softball sized packaging. The Fig. 5 U.S. Soldier in Balkans Using TIA-P
Fig. 6 TIA-0 Wearable Computer
Fig. 4 TIA-P Wearable Computer
sophisticated housing includes an embedded joypad as an
alternative input device to speech.
4. SMART MODULE APPROACH
Smart modules are a family of wearable
computers dedicated to speech processing application. A
smart module provides a service almost instantaneously
and is configurable for different applications. The design
goals also included: reduce latency, eliminate memory
context swaps, and minimize weight, volume, and power
consumption. The functional prototype consists of two
specialized modules, performing language translation
and speech recognition. The speech recognition module
uses CMU?s Sphinx 2 continuous, speaker independent
system [10],[11]. The speech recognition code was
profiled and tuned. Profiling identified ?hot spots? for
hardware and software acceleration and places to reduce
computational and storage resources. Input to the module
is audio and output is ASCII text. The speech recognition
module also supports test to speech synthesis. Figure 7
illustrates a combination of the language translation
module (LT), and speech recognizer (SR) module,
forming a complete stand-alone audio-based interactive
dialogue system for speech translation. As a result of the
profiling, we have achieved a five times smaller memory
requirement in comparison to the software desktop
version.
The LT module runs the PANLITE language
translation software [12], and the SR module runs
CMU?s Sphinx II Speech Recognition Software and
Phonebox Speech Synthesis software. Target languages
included Serbo-Croation, Korean, Creole French, and
Arabic. Average language translation performance was
one second per sentence.
5. SMART MODULE ARCHITECTURE
The Smart Module system has two distinct
kinds of processes: the Server-Application Group and the
System Controller. A Server-Application Group consists
of a UNIX background process which communicates
with an application, such as PANLITE, via Inter-Process
Communication within a module. The server process also
communicates with the System Controller over the
TCP/IP Network. The System Controller keeps track of
what servers are present on which modules, and
coordinates the flow of information between the servers.
It is possible to interface any number of applications with
one server process. This architecture makes it easy to add
new modules to the system. Figure 8 illustrates how data
flows between the System Controller and a Service-Application
Group. The System Controller operates on a
Newton MessagePad 2000 to give the user a chance to
correct misrecognized words. The Newton is primarily
used as a display device.
The key factors that determine how many
processes can be run on a module are memory, storage
space, and available CPU cycles. To minimize latency,
the entirety of an application's working dataset should be
memory resident.
The intermodule communications infrastructure
is a TCP/IP based network running over serial PPP links,
as detailed in Figure 9 [13]. TCP/IP can be built directly
into the Linux kernel, eliminating the need to deal with
the network in the Server software. It also supports
packet forwarding directly in the kernel. Finally, it can
be utilized over a variety of communications media,
supporting several wired as well as wireless connections.
It is even possible for the system to communicate with
Fig. 8. Flow of Data in Smart Module Software
Fig. 7. Speech Recognizer (SR) and Language
Translator (LT) Smart Module
Fig. 9. Serial PPP Communicationany TCP/IP based intranet or the Internet, if a module is
configured as a gateway with a connection to an outside
network.
The position of each module in the physical
network does not matter; the System Controller simply
sends out all communications for all modules over the
same link, creating a virtual network as shown in
Figure 10. The modules themselves handle routing. New
modules added to the system can have the capability to
modify each others? routing tables automatically.
Currently, because all of the modules used are physically
connected with each other, the Linux PPP server
automatically configures the routing tables of the
modules. But if more modules are added to the system, a
dynamic routing protocol must be used to modify the
tables of a module that may not be physically connected
to the module that is added.
The secondary storage drives are of Type II and
Type III PCMCIA form factor, but these drives also
support an IDE interface. The PCMCIA socket that is on
the Smart Modules is wired directly into the IDE bus,
and there is no PCMCIA controller in the hardware
design. While this precludes the use of anything other
than hard disks in the PCMCIA slots, it saves space in
the overall design.
Figure 11 depicts the functional prototype of the
Speech Translator Smart Module, with one module
performing language translation, and another one speech
recognition and synthesis. The optimized version of the
Speech Translator Smart Module is shown in Figure 12.
Operational Experience
The lessons learned from tests and
demonstrations include: the manual intervention process
to correct misrecognized words incurs some delay;
swapping can diminish the performance of the language
translation module; the size of display can be as small as
a deck of cards.
The required system resources for speech
translator software are shown in Table 1. We achieved a
six times speedup over the original desktop PC system
implementation of language translation, and five times
smaller memory requirements.
6. PERFORMANCE EVALUATION
Figure 13 illustrates the response time for
Fig. 12. Optimized Speech Translator SM
Fig. 11. Speech Translator SM Functional
Prototype
Fig. 10. The Smart Module Virtual Network
Table 1. Comparison of Required System Resources
Laptop / Workstation Functional Module SR / LT Optimized Module SR / LT
Memory Size 195 MB 53 MB 41 MB
Disk Space 1GB 350 MB 200 MBspeech recognition applications running on TIA-P, TIA-0,
and SR Smart Module. As SR is using a lightweight
operating system (Linux) versus Windows 95 on TIA-P
and TIA-0, and the speech recognition code is more
customized, it has a shorter response time. An efficient
mapping of the speech recognition application onto the
SR Smart Module architecture provided a response time
very close to real-time.
The performance of the family of Speech
Translation modules is summarized in Figure 14. The
metric for comparison in Figure 14 is proportional to the
processing power (SpecInt), representing performance,
and inversely proportional to the product of volume,
weight, and power consumption (R), representing
physical attributes. Figure 14 shows the normalized
performance scaled by volume, weight, and power
consumption. The diagram was constructed based on the
data shown in Table 2. A TI 6030 laptop is taken as a
baseline for comparison, and its associated value is one.
TIA-0 is a factor of 44 better than the laptop while SR
Smart Module is over 355 times better than the laptop
(i.e., at least a factor of five better in each dimension).
Therefore there are orders of magnitude improvement in
performance as we proceed from more general purpose to
more special purpose wearable computers.
7. CONCLUSIONS
Four generations of CMU wearable computers
have been built for real-time speech translation
applications, culminating in the Speech Translator Smart
Module. Our results show that there are orders of
magnitude improvement in performance as we proceed
from one generation of Wearable Computers performing
speech recognition to the next one. To our knowledge,
Speech Translator Smart Modules are the only wearable
computers capable of performing two-way speech
translation (involving speech recognition and language
translation).
A system-level approach to power /
performance optimization improved the metric of
(performance / (weight * volume * power)) by over a
factor of 300 through the four generations.
8. ACKNOWLEDGMENT
This work was supported by Defense Advanced
Research Project Agency Contract # DABT63-95-C-0026
and Institute for Complex Engineered System at
Carnegie Mellon University.
9. REFERENCES
[1] A. Smailagic and D. P. Siewiorek, ?The CMU
Mobile Computers: A New Generation of Computer
Systems,? Proceedings of the IEEE COMPCON 94,
IEEE Computer Society Press, February 1994, pp.
467-473.
[2] D.P. Siewiorek, A. Smailagic, and J.C. Lee, ?An
Interdisciplinary Concurrent Design Methodoogy as
Applied to the Navigator Wearable Computer
System,? Journal of Computer and Software
Engineering, Vol. 2, No. 2, 1994, pp 259-292.
[3] A. Smailagic, ?ISAAC: A Voice Activated Speech
Response System for Wearable Computers,?
Proceedings of the IEEE International Conference
on Wearable Computers, Cambridge MA, October
1997.
[4] D. Reilly, "Power Consumption and Performance of
a Wearable Computing System," Masters Thesis,
Carnegie Mellon University, Electrical and
Computer Engineering Department, 1998.
Table 2. Performance Values Measured and Calculated for Wearable Computers
Name SpecInt Volume (in
3
) Weight (lbs) Power (watts) R (V*W*P) Normalized - SpecInt/R Log of Normalized
TI 6030 175.00 260.00 7.50 36.00 70200.00 1.00 0.00
TIA-P 55.00 88.00 3.00 6.50 1716.00 12.86 1.11
TIA-O 55.00 45.00 2.50 4.50 506.25 43.58 1.64
SR-SM 175.00 45.00 2.13 4.00 382.50 183.53 2.26
OPT-SM 175.00 33.00 1.50 4.00 198.00 354.55 2.55
Performance Comparison
0.9
0.95
1
1.05
1.1
1.15
1.2
1.25
SR-SM TIA-P TIA-0
Systems
Fig. 13. Response Time Comparison[5] D.P. Siewiorek, A. Smailagic, L. Bass, J. Siegel, R.
Martin, B. Bennington, " Adtranz: A Mobile
Computing System for Maintenance and
Collaboration," Proceedings 2
nd
IEEE International
Conference on Wearable Computers, Pittsburgh,
PA, 1998.
[6] B. Bederson, ?Audio Augmented Reality: A
Prototype Automated Tour Guide,? Proc. of CHI
?95, May 1996, pp. 210-211.
[7] E.D. Mynatt, M. Back, R. Want, and R. Frederick,
?Audio Aura: Light-Weight Audio Augmented
Reality,? Proceedings of UIST ?97 User Interface
Software and Technology Symposium, Banff,
Canada, October 15-17, 1997
[8] N, Sawhney and C Schmandt, ?Design of
Spatialized Audio in Nomadic Environments,?
Proceedings of the International Conference on
Auditory Display, November 2-5, 1997, Palo Alto,
CA.
[9] Epson Corporation, Epson CARDIO 486-D4 Data
Sheet, 1997.
[10] M. Ravishankar, "Efficient Algorithms for Speech
Recognition," Ph.D Thesis, Carnegie Mellon
University, Tech. Report. CMU-CS-96-143, May
1996.
[11] K.F. Li, H.W. Hon, M.J. Hwang, and R. Reddy,
?The Sphinx Speech Recognition System,? Proc.
IEEE ICASSP, Glasgow, UK, May 1989.
[12] R. E. Frederking, R. Bown, "The Pangloss-lite
machine translation system. Expanding MT
Horizons," Proceedings of the Second Conference
of the Association for Machine translation in the
Americas, 1996, pp. 268-272.
[13] J. Dorsey, "Smart Module Networking," Personal
Communication, 1998.
Evolution of
Fig. 14. Composite Performance of Speech Recognition Wearable Computers