The National Institute for Standards and Technology (NIST) (National Institute for Standards and Technology, 2006) has been organizing multiple evaluations over the years on many aspects of speech technologies. In the area of speaker diarization evaluations, they started in year 2000 with interest in telephone speech (2000, 2001, 2002), broadcast news (2002, 2003, 2004) and meetings (2002, 2004, 2005, 2006). In the latest two years, focus has been geared exclusively towards the meetings environment.
The datasets used in the meetings evaluations were hand-transcribed by LDC. This acoustic data constitutes the basis for the development and evaluation of the algorithms proposed in this thesis. Initially, in 2002, the speaker segmentation task was enclosed within the speaker recognition evaluation (SRE-02) and used data from the NIST meeting room research project. This changed for 2004-2006 when speaker diarization has been a part of the Rich Transcription (RT) evaluation (RT04s, RT05s and RT06s), grouping it with the speech-to-text evaluation (STT) on meetings data. The datasets used for these evaluations contain data from CMU, ICSI, LDC, NIST, CHIL and AMI.
In the following sections the main ideas in the systems presented to each of the NIST meetings evaluations are explained, together with the particular algorithms that were created explicitly for processing of meetings data.