Multimodality in the Trecvid Evaluations

During this month NIST (National Institute of Standards and Technology) is organizing a set of evaluations called Trecvid in order to test several technologies related to video processing. The Trecvid evaluations ( are long-lived yearly events which started in the 90′s with the Trec evaluations on Text recognition, which focused on the processing of text for the tasks of information retrieval. On 2001 the video component was introduced which from 2003 became its own evaluation, which has become very prominent among researchers in the video/image fields.

The trecvid evaluation proposes a set of tasks to be done by the participants using a common benchmark database. Participants are to run their systems in these databases and return a set of answers to the task to the NIST organizers, who evaluate the answers and release later the results and how people did. Results and system descriptions are later explained in a Workshop which joins all researchers having participated. This is a perfect place to compare the different technologies and new ideas in topics of interest both to industry and academia. Trecvid (like many other evaluations by NIST) do not aim at being a competition, but a framework to evaluate technology in a fair way, with the same conditions.

Personally, I had participated in the past to several (3) evaluations from NIST called RT (Rich Transcription evaluations). In these, the data was composed of audio recordings from radio/TV broadcasts or meeting room recordings and the task was 1) to decode what was said in the recordings, and 2) to find how many people were speaking and find where in the recordings each was speaking. The RT’s are very well known within the speech community and have been running for a few years.

With my recent broadening of interests towards multimodal processing I became aware of the Trecvid evaluations, and in particular, of the video copy detection task, whose objective is to find video copies in a video database. In particular, this year we have been given around 400h of reference videos (of many lengths and sources, some of them in black and white and some in other languages than English). In order to test the systems we have been given a set of queries composed of shorter videos which may/may not contain a segment (all the query can be the segment, or it can just be a piece of it)  which is a transformation of a segment existent in the reference materials. The transformations are many possible, with different degrees of degradation in the audio and video parts. In the evaluation there are 3 different deadlines, the first one (just passed) is the video-only submission, where the queries are composed only of the video part, no audio. The second deadline is the audio only (August 28th) and the third part is the audio+video submission (1st October). This year is the first where the audio+video analysis is a mandatory submission for all teams.

I find it very interesting that Trecvid and NIST are trying to impulse research in the audio modality within this evaluation from this year on, and I hope this will lead to future years where audio will be at the same level of video, having the audio-only modality be also mandatory for participating labs. I agree that this is not an easy task, as many (or most) of the participating teams are composed of video-only researchers. I think, though (and I can talk with some experience) that getting into the opposite modality in terms of research (from audio into video for me, and from video into audio for most of Trecvid participants) is a very enriching activity, with many new ideas coming from the application of well established techniques to the new modality, which have never been explored because usually the audio and video groups never mix, and sometimes are even in different physical locations.

I see this needed fusion and understanding like the one I was involved in while finishing my EE studies. I knew at that time that I wanted to pursue a career as a speech engineer and therefore searched for the opportunity to join classes in Universities in my city where linguistics classes were taught, in order to get in touch with some of the people and knowledge that I would have to work with later on when working on speech recognition systems or Text to speech applications. This was very enriching personally and professionally. As some professor used to say, I tried to “bridge the gap” between linguists and engineers.

With the audio and video communities I think we should try doing the same, and the Trecvid evals could be on point where both areas get together and discuss on common problems. We can all benefit a lot from multimodality, and definitely the technology will also improve dramatically when we look at the problem from orthogonal perspectives. In order to do so, we need many more activities where audio and video come together into the same umbrella, but also we need some help getting people from the two fields interested in each other. It is not enough to get a European project together where each one does their thing and do not talk to each other. We need real collaboration where algorithms and ideas flow both ways. One good way to start would be to create real multimodal databases where annotations would be of quality both for the audio and the video part.

I am very happy working in a multimodal area and I am very glad I found Trecvid and the video copy detection task, the perfect place where to exercise my ideas.

You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

Leave a Reply

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>