note also this article! sciam.com FYI: ----------------------Talking with Your Computer
Speech-based interfaces may soon allow computer users to retrieve data and issue instructions without lifting a finger
by Victor Zue
...........
RELATED ARTICLES:
The Future of Computing
Talking with Your Computer
Communications Chameleons
Raw Computation
SUBTOPIC:
Galaxy Speaks
SIDEBAR:
A Conversation with Jupiter
FURTHER READING
For decades, science-fiction writers have envisioned a world in which speech is the most commonly used interface between humans and machines. This is partly a result of our strong desire to make computers behave like human beings. But it is more than that. Speech is natural--we know how to speak before we know how to read and write. Speech is also efficient--most people can speak about five times faster than they can type and probably 10 times faster than they can write. And speech is flexible--we do not have to touch or see anything to carry on a conversation.
The first generation of speech-based interfaces is beginning to emerge, including high-performance systems that can recognize tens of thousands of words. In fact, you can now go to various computer stores and buy speech-recognition software for dictation. Products are offered by IBM, Dragon Systems, Lernout & Hauspie, and Philips. Other systems can accept extemporaneously generated speech over the telephone. AT&T Bell Labs pioneered the use of speech-recognition systems for telephone transactions, and now companies such as Nuance, Philips and SpeechWorks have also entered the field. The current technology is employed in virtual-assistant services, such as General Magic's Portico service, which allows users to request news and stock quotes and even listen to e-mail over the telephone. But the Oxygen project will need far more advanced speech-recognition systems.
I believe the next generation of speech-based interfaces will enable people to communicate with computers in much the same way that they communicate with other people. Therefore, the notion of conversation is very important. The traditional technology of speech recognition--which converts audible signals to digital symbols--must be augmented by language-understanding software so that the computer can grasp the meaning of spoken words.
On the output side, the machine must be able to verbalize; it has to take documents from the World Wide Web, find the appropriate information and turn it into well-formed sentences. Throughout this process the machine must be able to engage in a dialogue with the user so that it can clarify mistakes it might have made--for example, by asking questions such as "Did you say Boston, Massachusetts, or Austin, Texas?"
Galaxy Speaks
We at the M.I.T. Laboratory for Computer Science have spent the past decade working on systems with this kind of conversational interface. Unfortunately, the machines developed so far are not terribly intelligent; they can deal only with limited domains of knowledge, such as weather forecasts and flight schedules. But the information is up-to-date, and you can access it over the telephone. The machines are capable of communicating in several languages; the three to which we pay the most attention are American English, Spanish and Mandarin Chinese. These systems can answer queries almost in real-time--that is, just as quickly as in a normal conversation between two people--when the delays in downloading data from the Web are discounted.
The speech-based applications we have produced are founded on an architecture called Galaxy, which our group introduced five years ago. It is a distributed architecture, which means that all the computing takes place on remote servers. Galaxy can retrieve data from several different domains of knowledge to answer a user's query. The system can handle multiple users simultaneously, and last but not least, it is mobile. You can access Galaxy using only a phone, but if you also have an Internet connection, you can tell the machine to download data to your computer.
Galaxy has five main functions: speech recognition, language understanding, information retrieval, language generation and speech synthesis. When you ask Galaxy a question, a server called Summit matches your spoken words to a stored library of phonemes--the irreducible units of sound that make up words in all languages. Then Summit generates a ranked list of candidate sentences--the machine's guesses at what you actually said. To make sense of the best-guess sentence, the Galaxy system uses another server called Tina, which applies basic grammatical rules to parse the sentence into its parts: subject, verb, object and so forth. Tina then formats the question in a semantic frame, a series of commands that the system can understand. For example, if you asked, "Where is the M.I.T. Museum?" Tina would frame the question as the command "Locate the museum named M.I.T. Museum."
At this point, Galaxy is ready to search for answers. A third server called Genesis converts the semantic frame into a query formatted for the database where the requested information lies. The system determines which database to search by analyzing the user's question. Once the information is retrieved, Tina arranges the data into a new semantic frame. Genesis then converts the frame into a sentence in the user's language: "The M.I.T. Museum is located at 265 Massachusetts Avenue in Cambridge." Finally, a commercial speech synthesizer on yet another server turns the sentence into spoken words.
Our laboratory has so far created about half a dozen Galaxy-based applications that can be accessed by telephone. Jupiter offers weather information for 500 cities worldwide. Pegasus provides the schedules of 4,000 commercial airline flights in the U.S. every day, updated every two or three minutes. Voyager is a guide to navigation and traffic in the greater Boston area. To move from one application to another, the user simply says, "I want to talk to Jupiter" or "Connect me to Voyager." Since May 1997 Jupiter has fielded more than 30,000 calls, achieving correct understanding of about 80 percent of the queries from first-time users. The calls are recorded and evaluated to improve the system's performance [see sidebar].
Speech recognition would be an ideal interface for the handheld devices being developed as part of the Oxygen project. Using speech to give commands would allow much greater mobility--there would be no need to incorporate a bulky keyboard into the portable unit. And spoken language would enable users to communicate with their devices more efficiently. A traveling executive could say to his or her computer, "Let me know when Microsoft stock is above $160." The machine would act much like a human assistant, accomplishing a variety of tasks with minimum instruction.
Of course, several research problems still need to be addressed. We must create speech-recognition applications that can handle many complex domains of information. The systems must be able to draw data from different domains--the weather information domain, for example, and the flight information domain--without being specifically instructed to do so. We must also increase the number of languages that the machines can understand. And finally, to exploit the spoken-language interface fully, the systems must be able to do more than just what I say--they must do what I mean. Ideally, tomorrow's speech-based interfaces will allow machines to grasp their users' intentions and respond in context. Such advanced systems probably will not be available for at least a decade. But once they are perfected, they will become an integral part of the Oxygen infrastructure.
Further Reading:
Publications from the Spoken Language Systems Group at LCS
The Author
VICTOR ZUE is an associate director of the M.I.T. Laboratory for Computer Science and head of the lab's Spoken Language Systems Group. He is also a senior research scientist at M.I.T., where he received his Sc.D. in electrical engineering in 1976. |