NIST Rich Transcription Evaluations in Speaker Diarization for Meetings

The Rich transcription evaluations conducted by NIST started with the RT02s in 2002 until the latest one (RT06s). According to NIST (Spring 2005 (RT-05S) Rich Transcription Meeting Recognition Evaluation Plan (n.d.), Spring 2006 (RT-06S) Rich Transcription Meeting Recognition Evaluation Plan (n.d.)) the Rich Transcription (RT) of a spoken document addresses the need for information other than the set of words that have seen said (extracted with a Speech-to-Text, STT, system). When obtaining a transcription of the words that have been spoken in a recording it is difficult to receive all the information that the speakers tried to convey. This is because spoken language is much more than just the spoken words; it contains information about the speakers, prosodic cues and intend, and much more.

The goal of future RT systems is for transcripts to be created with all sorts of metadata to allow the user to fully understand the content of an audio recording without listening to it. In the recent RT evaluations NIST has focused on three core technologies that are important elements of the metadata content. These are Speech-to-Text (STT), Speaker Diarization (SPKR) and Speech Activity Detection (SAD). In the last two years (RT05s and RT06s) evaluations have been focusing on the meetings domain.

Subsections

user 2008-12-08