Speech Technologies: From front row deception to back stage monetization
Around 25 years ago, with the introduction of Hidden Markov Models (HMM) by Lawrence R. Rabiner to stochastically model speech as a time series, the field of speech technologies started its own research and industrial revolution. Until then, the recognition of “what is spoken” was approached using time-warped pattern matching algorithms, which were too computationally demanding for the available infrastructure to allow for prototypes to leave the university labs, in most cases. The sudden availability of cheaper models, with increasingly better speech recognition performances, opened a lot of possibilities for the industry that saw many ways to monetize the research results that kept coming out from research labs. Many universities opened speech labs, some research institutes were started and many companies opened speech R&D groups all over the globe. From all companies that had speech activities at that time, some of the most prominent were AT&T Bell labs (later AT&T Research), of which Rabiner was department head at that time, and IBM Research, with Frederick Jelinek as head of the speech research group.
For Telcos, the development of speech recognition applications was a great opportunity to offer its customers with value-added services at virtually zero extra cost, as it was a machine that would be taking these calls and, many times, revolve the customer’s issues without transfering him to a human representative. Of course, in 1985 the speech recognition accuracy was not really achieving usable performances, but seemed to improve pretty fast, going beyond human errors levels (set to between 2 and 4% word error rate) in some trivial tasks. For example, see this figure from NIST on the evolution of speech recognition performance through evaluations performed by NIST over the years. But what seemed to reach very promising results for certain tasks up to around 1995, then flattened out well above human levels, and has stayed there ever since. In fact, every time that a new acoustic environment or content domain is introduced, the error rate skyrockets at first and then, through the retraining of acoustic and language models (and sometimes adding some more layers of complexity to the, already very complex, ASR systems), it starts slowly decreasing. For this reason, shortly after the bubble burst in 2001, many companies and research centers started turning their backs to pure speech recognition research, arguing that if it had not gotten ready for mass-market production until then (due to an annoying 20-30% word error rate, always present in many applications) it would never get there. Many companies downsized or eliminated their speech groups, or reconverted them into “multimedia” groups, which seems to have lately become (well deserved) the next trendy word. To even make things worse, in the later years it seems that it has become much more difficult by those university labs and institutions in the US to get speech-only projects funded by Government sponsoring agencies.
To exemplify this, see for example the recent article “whatever happened to speech recognition?” by Jeff Artwood or “Rest in peas: The unrecognized death of speech recognition” by Robert Fortner.
In my opinion, all this bitter-sweet attitude versus speech recognition technology and the elimination of continued efforts by companies to push technology forward responds to a clear fact: ASR will probably never allow us to interact freely with the machines in the way that it was envisioned 25 years ago. We can just not speak to a computer on the other side of the room, or within a very noisy mall, speaking normally without a painful (full of repetitions and shouting) experience. This way of interaction is what I call “front row” interaction, which has deceived many high-profile people that turned their backs against speech technology. Of course, there are many exceptions to the previous statement, mostly when some constraints can be applied to the speech recognition technology depending on the application. For example, some companies like M*Modal achieve usable ASR performances when recognizing medical doctors’ dictation of patient records by highly adapting the system to each doctor’s speech, and by limiting the vocabulary domain of the recognizer. Also, other companies like Yap and Vlingo can achieve acceptable results on voice search and transcription of voicemails by restringing the possible message domain and controlling the acoustic conditions. Even Google seems to be lately pushing strong for speech recognition by applying it to many of their applications, with variable success.
But what most people maybe do not realize is that speech technology is not only about speech recognition. There are many other areas of research that have evolved thanks to speech recognition research. These have taken parallel lives on their own to derive into many other technologies, like for example speaker recognition, language recognition, emotion recognition, audio fingerprinting, music classification and structure analysis, word spotting/search, etc.
All these technologies are at the core of what I call “back-stage” speech technologies, bringing important monetization opportunities to companies and university groups. One important characteristic of back-stage speech technologies is that resulting accuracy levels do not need to be perfect for the implemented application to be useful for the end user. Some of the aforementioned technologies achieve accuracy levels similar to those obtained by speech recognition, while others are currently already very good, much better than a human can do. For example in speaker verification a machine can identify whether a person is who claims to be much better than a human can, even is such person tries to impersonate the original speaker. A similar thing happens in language identification, where just 3 seconds of speech are enough to identify which language was spoken.
For example, algorithms in the back-stage can be implemented to listen to speech coming in and to issue recommendations to a human agent dealing directly with the customer, or to trigger alarms in a surveillance system in the case that some relevant information is detected. In other cases these systems can be set to run on the background in big data processing centers trying to organize vast amounts of multimedia information, which otherwise would have never been looked at by a human, as it would just take too much effort.
It is therefore justified and acceptable that these systems do not perform their task with perfect accuracy, as their output can only help the user/company by bringing forward some relevant information they would have had to spend much more effort to obtain manually. In this context even speech recognition (the “father” of all other technologies) becomes a valuable asset as a way to coarsely find what is inside multimedia documents, and therefore be able to, for example, classify or summarize them.
As I mentioned earlier, “multimedia” and “multimodal” are two terms that have become very trendy in the later years within the funding and research circles. In back-stage processing it is very easy to combine the analysis of audio with that of video, text and any other modality/es available, obtaining much more robust results. This is not straightforward today to bring together groups working on image and audio processing, as sometimes these belong to different organizations or have just never worked together, but this will definitely change when the benefits of doing so overshadow the difficulties.