Latest Publications

Speech Technologies: From front row deception to back stage monetization

Around 25 years ago, with the introduction of Hidden Markov Models (HMM) by Lawrence R. Rabiner to stochastically model speech as a time series, the field of speech technologies started its own research and industrial revolution. Until then, the recognition of “what is spoken” was approached using time-warped pattern matching algorithms, which were too computationally demanding for the available infrastructure to allow for prototypes to leave the university labs, in most cases. The sudden availability of cheaper models, with increasingly better speech recognition performances, opened a lot of possibilities for the industry that saw many ways to monetize the research results that kept coming out from research labs. Many universities opened speech labs, some research institutes were started and many companies opened speech R&D groups all over the globe. From all companies that had speech activities at that time, some of the most prominent were AT&T Bell labs (later AT&T Research), of which Rabiner was department head at that time, and IBM Research, with Frederick Jelinek as head of the speech research group.

For Telcos, the development of speech recognition applications was a great opportunity to offer its customers with value-added services at virtually zero extra cost, as it was a machine that would be taking these calls and, many times, revolve the customer’s issues without transfering him to a human representative. Of course, in 1985 the speech recognition accuracy was not really achieving usable performances, but seemed to improve pretty fast, going beyond human errors levels (set to between 2 and 4% word error rate) in some trivial tasks. For example, see this figure from NIST on the evolution of speech recognition performance through evaluations performed by NIST over the years. But what seemed to reach very promising results for certain tasks up to around 1995, then flattened out well above human levels, and has stayed there ever since. In fact, every time that a new acoustic environment or content domain is introduced, the error rate skyrockets at first and then, through the retraining of acoustic and language models (and sometimes adding some more layers of complexity to the, already very complex, ASR systems), it starts slowly decreasing. For this reason, shortly after the bubble burst in 2001, many companies and research centers started turning their backs to pure speech recognition research, arguing that if it had not gotten ready for mass-market production until then (due to an annoying 20-30% word error rate, always present in many applications) it would never get there. Many companies downsized or eliminated their speech groups, or reconverted them into “multimedia” groups, which seems to have lately become (well deserved) the next trendy word. To even make things worse, in the later years it seems that it has become much more difficult by those university labs and institutions in the US to get speech-only projects funded by Government sponsoring agencies.

To exemplify this, see for example the recent article “whatever happened to speech recognition?” by Jeff Artwood or “Rest in peas: The unrecognized death of speech recognition” by Robert Fortner.

In my opinion, all this bitter-sweet attitude versus speech recognition technology and the elimination of continued efforts by companies to push technology forward responds to a clear fact: ASR will probably never allow us to interact freely with the machines in the way that it was envisioned 25 years ago. We can just not speak to a computer on the other side of the room, or within a very noisy mall, speaking normally without a painful (full of repetitions and shouting) experience. This way of interaction is what I call “front row” interaction, which has deceived many high-profile people that turned their backs against speech technology. Of course, there are many exceptions to the previous statement, mostly when some constraints can be applied to the speech recognition technology depending on the application. For example, some companies like M*Modal achieve usable ASR performances when recognizing medical doctors’ dictation of patient records by highly adapting the system to each doctor’s speech, and by limiting the vocabulary domain of the recognizer. Also, other companies like Yap and Vlingo can achieve acceptable results on voice search and transcription of voicemails by restringing the possible message domain and controlling the acoustic conditions. Even Google seems to be lately pushing strong for speech recognition by applying it to many of their applications, with variable success.

But what most people maybe do not realize is that speech technology is not only about speech recognition. There are many other areas of research that have evolved thanks to speech recognition research. These have taken parallel lives on their own to derive into many other technologies, like for example speaker recognition, language recognition, emotion recognition, audio fingerprinting, music classification and structure analysis, word spotting/search, etc.

All these technologies are at the core of what I call “back-stage” speech technologies, bringing important monetization opportunities to companies and university groups. One important characteristic of back-stage speech technologies is that resulting accuracy levels do not need to be perfect for the implemented application to be useful for the end user. Some of the aforementioned technologies achieve accuracy levels similar to those obtained by speech recognition, while others are currently already very good, much better than a human can do. For example in speaker verification a machine can identify whether a person is who claims to be much better than a human can, even is such person tries to impersonate the original speaker. A similar thing happens in language identification, where just 3 seconds of speech are enough to identify which language was spoken.

For example, algorithms in the back-stage can be implemented to listen to speech coming in and to issue recommendations to a human agent dealing directly with the customer, or to trigger alarms in a surveillance system in the case that some relevant information is detected. In other cases these systems can be set to run on the background in big data processing centers trying to organize vast amounts of multimedia information, which otherwise would have never been looked at by a human, as it would just take too much effort.

It is therefore justified and acceptable that these systems do not perform their task with perfect accuracy, as their output can only help the user/company by bringing forward some relevant information they would have had to spend much more effort to obtain manually. In this context even speech recognition (the “father” of all other technologies) becomes a valuable asset as a way to coarsely find what is inside multimedia documents, and therefore be able to, for example, classify or summarize them.

As I mentioned earlier, “multimedia” and “multimodal” are two terms that have become very trendy in the later years within the funding and research circles. In back-stage processing it is very easy to combine the analysis of audio with that of video, text and any other modality/es available, obtaining much more robust results. This is not straightforward today to bring together groups working on image and audio processing, as sometimes these belong to different organizations or have just never worked together, but this will definitely change when the benefits of doing so overshadow the difficulties.

Multimodality in the Trecvid Evaluations

During this month NIST (National Institute of Standards and Technology) is organizing a set of evaluations called Trecvid in order to test several technologies related to video processing. The Trecvid evaluations (http://www-nlpir.nist.gov/projects/trecvid/) are long-lived yearly events which started in the 90’s with the Trec evaluations on Text recognition, which focused on the processing of text for the tasks of information retrieval. On 2001 the video component was introduced which from 2003 became its own evaluation, which has become very prominent among researchers in the video/image fields.

The trecvid evaluation proposes a set of tasks to be done by the participants using a common benchmark database. Participants are to run their systems in these databases and return a set of answers to the task to the NIST organizers, who evaluate the answers and release later the results and how people did. Results and system descriptions are later explained in a Workshop which joins all researchers having participated. This is a perfect place to compare the different technologies and new ideas in topics of interest both to industry and academia. Trecvid (like many other evaluations by NIST) do not aim at being a competition, but a framework to evaluate technology in a fair way, with the same conditions.

Personally, I had participated in the past to several (3) evaluations from NIST called RT (Rich Transcription evaluations). In these, the data was composed of audio recordings from radio/TV broadcasts or meeting room recordings and the task was 1) to decode what was said in the recordings, and 2) to find how many people were speaking and find where in the recordings each was speaking. The RT’s are very well known within the speech community and have been running for a few years.

With my recent broadening of interests towards multimodal processing I became aware of the Trecvid evaluations, and in particular, of the video copy detection task, whose objective is to find video copies in a video database. In particular, this year we have been given around 400h of reference videos (of many lengths and sources, some of them in black and white and some in other languages than English). In order to test the systems we have been given a set of queries composed of shorter videos which may/may not contain a segment (all the query can be the segment, or it can just be a piece of it)  which is a transformation of a segment existent in the reference materials. The transformations are many possible, with different degrees of degradation in the audio and video parts. In the evaluation there are 3 different deadlines, the first one (just passed) is the video-only submission, where the queries are composed only of the video part, no audio. The second deadline is the audio only (August 28th) and the third part is the audio+video submission (1st October). This year is the first where the audio+video analysis is a mandatory submission for all teams.

I find it very interesting that Trecvid and NIST are trying to impulse research in the audio modality within this evaluation from this year on, and I hope this will lead to future years where audio will be at the same level of video, having the audio-only modality be also mandatory for participating labs. I agree that this is not an easy task, as many (or most) of the participating teams are composed of video-only researchers. I think, though (and I can talk with some experience) that getting into the opposite modality in terms of research (from audio into video for me, and from video into audio for most of Trecvid participants) is a very enriching activity, with many new ideas coming from the application of well established techniques to the new modality, which have never been explored because usually the audio and video groups never mix, and sometimes are even in different physical locations.

I see this needed fusion and understanding like the one I was involved in while finishing my EE studies. I knew at that time that I wanted to pursue a career as a speech engineer and therefore searched for the opportunity to join classes in Universities in my city where linguistics classes were taught, in order to get in touch with some of the people and knowledge that I would have to work with later on when working on speech recognition systems or Text to speech applications. This was very enriching personally and professionally. As some professor used to say, I tried to “bridge the gap” between linguists and engineers.

With the audio and video communities I think we should try doing the same, and the Trecvid evals could be on point where both areas get together and discuss on common problems. We can all benefit a lot from multimodality, and definitely the technology will also improve dramatically when we look at the problem from orthogonal perspectives. In order to do so, we need many more activities where audio and video come together into the same umbrella, but also we need some help getting people from the two fields interested in each other. It is not enough to get a European project together where each one does their thing and do not talk to each other. We need real collaboration where algorithms and ideas flow both ways. One good way to start would be to create real multimodal databases where annotations would be of quality both for the audio and the video part.

I am very happy working in a multimodal area and I am very glad I found Trecvid and the video copy detection task, the perfect place where to exercise my ideas.

Christmas is here!

Dear readers,

many months have passed by since my last post. It was right in the middle of my trip to India, from which I have now fully recovered. Some day I will write a little more about my final thought regarding that interesting country.

Today I am enjoying my first day of vacation from work and preparing for the Christmas days by arranging my house and thinking about the presents I need to get. I am also sending a few christmas portcards by email with a video attachment. I just discovered this new way of greeting Christmas from a coworker and I must admit that it is a great idea, now that most everybody has access to broadband internet and appreciate novel ways of greeting the holidays, other than a note, a picture or a picture with music.

I am also slowly working on my new website cover and content. For now this blog appears as the cover but as soon as I can finish it you will have access to my pictures, my publications, PhD thesis online and, of course, this blog. So keep posted.

Well, let’s keep the stuff going, merry Christmas and happy new year 2009!

Shopping pressure in India

When I got to India I though that the guides were telling me to be careful with people trying to speal money from me and to trick me just because I was a tourist. I got here a little afraid of that and I have found a totally different story that I want to share with you.

Pressure here by people trying to sell you a service or good is enourmous. I had never experienced people walking next to me for 5 minutes trying to sell me something, or for me to sit on their rickshaw to take me some place. I have to also admit that I have not felt in danger at any time that I would get my wallet stolen, or felt that they were looking for it even (I cary it in a secure place, nonetheless).

Bu let’s focus on the shopping pressure. Imagine you are waling down las Ramblas and every gifts shop has 2-5 people sitting outside and jumping on to you offering you to enter their shop becauyse they have very nice “something” or very cheap “whatever”. They sometimes list you all things that they can sell you (that must be ranked according to their top sells) and if they see that you make a different face when saying any of them they start with the pricing drop, to try to attract your attention. Furthermore, if they see you’re looking at anything they quickly take it to you and start with the pricing game.

My friend has the theory that their initial price can be lowered up to 30% that value. Getting lower than that is difficult and will take you more time, but we have done it :) I have to say that shopping in India has become like a sport for us, in which you end up with tons of things that probably won’t have space to put in my appartment (even if it’s new, read my previous posts) and won’t have space in the luggage.

Finally, I cannot conclude without telling you about comissions. Anywhere I have asked for information to someone they have given me an interested answer, either directing me to a place where they will get a comission or telling me that where/what I wanted was not possible, and telling me that they had another option, which was very cheap, of course. If you go shopping and can avoid these people from taking comissions you’ll be in a better position for barghaining a good price (which will not include their comission).

A grades Rasgos

10 days have already elapsed of this journey through the north of India and just now I have a “relaxing” evening in Agra (home of the Taj Mahal) to write an overall impression of the trip so far.

I will possibly talk about many of the particulars in other posts, but this means to be an “overall  impressions” entry, o “a grandes rasgos”.

When we got to India the welcoming was a bit harsh. We landed on July 14th at 11PM. Given that July 15th was national holyday and that the president had to give a talk in Delhi’s Red Fort, we got a bit stuck in the airport with people telling us that they would not be able to take us to the hotel (Within the restricted area) and that the hotel had shut down for the night.

We got serious and after being in a taxi for 2:30 hours we got to what seemed to be a hotel. The hotel turned out to be very nice inside, eventhough we had to get in through some small doors and avoiding people that were sleeping in the street.

The next day we met 2 girls from Mataro and their mother, and we embarked together on an 8 days trip through the Rajstan by taxi. For this we had the company of 2 drivers that did not speak much english at all (otherwise from what the tourist office had told us). 

We have just finished our toor and landed in Agra. This has by far the hottest climate so far. Today me and my comanion have suffered a bit from a heat stroke as we walked though the Taj Mahal area.

Tomorrow we are heading towards Delhi but just doing a pitstop in our way to Rishikesh, town of Yoga and ayurvedic massages. I’ll tell you more about that later.

De paseo por… India

Hi friends,

this si about to start… in two days I’ll be heading off to India for 15 days of vacation. I have been looking forward to it for a couple of months now, and although one of the three murketeers unfortunately cannot make it, we are still in good mood and looking forward to Ankara, Delhi, and a cultural shock.

In fact, in order for the trip to be just a cultural shock, and not an illnesses shock, I have finally let myself down to the vaccination center in Barcelona. I am not a good friend of needles and I had been putting off the trip to the doctor (even considering not going at all), until today I was talking to a friend at work and convinced me that this was not very good.

I am happy to have gone and in fact, I left with 3 shots and no pain at all (except for sosme pain in my wallet). Except for some time waiting, I got an excellent attention and very professional people who answered all my questions and offered me more information that I had asked for about what to buy in terms of medicines, and what to do and not to do in there.

So I would just say, listen to your friends all the time, but leave to professionals whatever comes to health issues. They are the only ones that really can tell you what is really convinient for you. 

My new flat

Dear reader,

it’s been a while (over 6 months) from my first post til this one, and I felt that enough was enough, that I need to focus and write a bit more. One of the reasons why I have not been very prolific in writting is because I just got my new flat, which took some time to find and has taken (less) to settle in.

Now I can say that I am an owner! of a morgage, al least.

I will leave the description and some pictures about the flat for another time. Now I would like to talk about the “moving in” issue. I call it an issue as it is not an easy thing, mostly when the flat is not a new one (like mine) and you have a bunch of things to change from an old flat to this one.

In my case the actual “moving” was pertty fast. I had to be out of my old 30m2 rented appartment by July 31st, and I only got to sign the morgage and get the keys by the 30th afternoon. Me and my girlfriend started carrying boxes on that evening and we finished the big stuff  with my dad on the 31st in the morning. We can definitely say that everything was finalized by the 31st afternoon, as I needed to leave for Valencia, where I had to be on the 1st :)  So we carried a full bunch of boxes into the livingroom in less than 24 hours, with 3 people and lots, lots of sweat…. and a parking violation ticket.

Getting all these boxes into their places is not as easy as I thought. First you need to make sure that the place you’re going to place something is clean before you do so, which sometimes hapened not to be, and therefore there is some time spent in trying to get that spot or dust out of there. I am alergic to dust, which makes it even funnier as I could not stop sneezing during the whole days I have been putting order…

Once the “recipient” is clean, it is time to put the content in there. In doing so it is not easy job either, as both the new and old flats are not the same (one is 3.5x the size of the other). There is normally a lot of thinking on where I want my things stored, which is normally contrasted and discussed on an item basis with the loved ones (in my case it was not so bad).

Finally, today, I can say that there are not more full boxes in my livingroom (there are some empty ones, waitin for a trip to the recicling container downstairs). The though part has finished, but now another paret starts: the rest of the cleaqning and the improvements part. I have many ideas on things that I can place in the flat to make it more livable, and there is still a lot to clean (including the livingroom floor) which will certainly be adressed next.

One thing is true, it makes a big difference to clean and put things in order in an appartment that is yours than one that is just rented. I have spent the last 10 years renting  places, and it feels more important now.

Overall, I love my new flat!  

This is cool

Hi reader!

I would have never promised anyone that I would be starting a blog this year. Indeed, this was not one of my year’s resolutions, but here I am :)

Lately I have discovered how important it is to get our individual word out. As insignificant it might seem, it can help someone, and it can also help me. In fact, by writting things, or talking them out loud, I get them more clear in my mind.

This blog will not be about technical stuff only. I am a EE/Telecos and I work on speech processing, but not all in life is work, or is it?

The blog will neither be about politics. I am not very political myself, therefore although sometimes I might reffer to something that catched my attention, I will only take the side that aligns with what I think, and not what whatever political party says.

Also, the blog is not intended to be about traveling (cause I am mostly stationary right now), or food (I like to eat it, not to talk about it) or culture, or economy, or…

So, what can I talk about? a little bit of everything and a lot of nothing. This is why I titled it “Estoy de paseo”, which it means that I am taking a walk around many topics, whichever catched my attention and I felt like writting about a bit.

Ah, and who am I? I am Xavi, living in Barcelona (Spain) as for now, working for a big telco company and currently single.