Monday, July 19, 2004

Telirati Newsletter #44

Five years ago I wrote this newsletter, and I came not to praise speech recognition, but to bury it. In the subsequent five years, speech recognition lives on as a novelty in mobile phone handsets and cards. One of my colleagues recently demonstrated his new Honda's abilities. "Take me to the nearest whorehouse!" he exclaimed. And, dutifully, the navigation system displayed nearby hospitals, in anticipation, no doubt of him catching some STD.

My prediction here is that speech technology providers would continue to bark up the wrong tree, and they have maintained that course with stubborn steadfastness. Interdisciplinary approaches are still left uninvestigated, and products are only slightly less of a laughingstock than they used to be.


Telirati Newsletter #44: The Shock of Recognition

Speech recognition is a curious beast: Sometimes it appears to have been tamed. It jumps through hoops on the trade show stage and once again we are drawn to believe. Perennially, Bill Gates makes the assertion that, really truly, we will have a "natural" interface to our computing environment. But sometimes the full horror is revealed. It twists our words into parody or simply refuses to behave. It cruelly mocks our ambition to move beyond the keyboard. For more than a dozen years, just on the PC platform, it has been "real soon now." This darker side comes into view only occasionally. One such occasion was a recent article by BT's chief technologist. Writing in his regular column the London Telegraph, he described his encounter with dictation software he bought because of a wrist injury. This is a singular event because this emperor of R&D has some 30 boffins working on nothing but speech recognition technology and applications. He had every reason to assert that he emerged from the dictation software fitting room fully clothed. Instead his candid account of the state of speech recognition serves notice to the believers that they are nakedly overoptimistic:

Like a dummy, I believed what was on the box and purchased a speech-to-text program. After five hours of continuous training (of the computer, not me), this expensive package was capable of creating complete gibberish. A quick search on the Net found many alternatives, one of which was a tenth of the price. Having purchased this cheaper package, I found it equally qualified in the gibberish department. Why is it taking speech-to-text such a long time to become a practical reality? And how come some people seem to be able to get them to work... or do they?


Well, what to make of this? Dragon Systems, IBM, and Lernout & Hauspie have been making and selling dictation and voice control software for almost the entire existence of personal computers, and this category of software has been available on mainstream retail store shelves since the days of Windows 3.1. People buy this stuff. They use it. They haven't all returned it (though, if you consider the amount of shelfware on your own shelf you might find the last part not too surprising). Is Mr. Cochrane too picky? Does he not get it? As it turns out, the problems with speech recognition software now are the same ones that existed a decade ago. Not only are the problems discouraging, the lack of progress in solving them is a strong indictment of a lack of creativity among the purveyors of speech technologies:

With military systems, banking and telephone operations, it is now possible to embark on an adventure of human-machine voice interaction. Even cars and television sets can be reliably controlled and commanded by speech. So why can't my laptop understand what I say? Well, there seem to be a number of key problems with all generalised speech-to-text technologies. First, the acoustic environment, noise and echoes in the room play a critical part in disrupting performance. It also seems to be vital to get the microphone precisely positioned and the computer set up just so. It is also imperative to be clear in your diction and dictation, leaving adequate spaces between words, which means adopting the monotonic regularity of a robot. It also pays not to have a cold or to be thirsty. Even worse, the problems associated with our wide vocabulary and use of several words for one meaning or different contexts seem to kill developers' efforts to create a generalised environment.

Cochrane knows it can be done, in some environments, and for some applications. It makes the lack of progress in addressing the environment of general purpose dictation that much more frustrating. BT, as a user of recognition technologies, has figured out where to apply them, what conditions are suitable for speech recognition, and what special measures have to be taken to make recognition work as well as possible. With dictation on personal computers, almost none of this kind of applications sense has been applied to the problem. The underlying recognition technology gets better each year and with each faster CPU, but that only affects the margins. Cochrane's overall experience is little different than when dictation software was a novelty.

Cochrane goes on to suggest a few things that could help: more computing power, and more attention to the context in which words are used. But this does not go nearly far enough, and, with a more open consideration of possible remedies it should be possible to make dictation work significantly better without a radical increase in computing power: Lip reading, using an inexpensive camera. Detection of the position of the user's head. Where is the user looking? With USB, a speech recognition product could be packaged with a microphone that gives uniform and predictable results, unlike the present situation with a motley assortment of cheap-as-possible mics and sound cards. What about a mic specialized to speech recognition, worn directly against the throat, to exclude environmental noise? And these brief musings are only the surface of what can be done to address an obvious and longstanding problem.

Creativity, cross-functional integration, and sophistication in integrating speech technology with the computing environment are the elements of success. But, instead of addressing these, speech recognition technology providers still put almost all their resources into the core recognizer technology. The folly of this should be easy to see: the level of sophistication of recognizers varies widely, but their performance in demos is uniformly wonderful. So instead of mocking the crudeness of a competitor's recognizer, some speech recognition company out there should take this as a hint that other avenues of competition are open, ready to be exploited, and may yield differences that customers will find very easy to see and appreciate.

Copyright 1999 Zigurd Mednieks. May be reproduced and redistributed with attribution and this notice intact.