Meetings Domain Experiments Setup

When comparing the results of new speech-related algorithms it is usual to always face some sort of ``flakiness''. This term started being used for speaker diarization during the RT04f workshop (NIST Fall Rich Transcription Evaluation website, 2006) in order to account for two phenomena that were common to all diarization systems presented in that evaluation. These were the big variance of the scores among all evaluated shows and the extreme susceptibility of the scores to experience big changes upon small modifications of their tuning parameters.

Alike some other disciplines within the speech technologies, it makes a difference, when comparing the performances of algorithms compared to a baseline, to select the optimum baseline databases and test conditions to be able to show when the proposed algorithms preform the best. In many cases, due to flakiness, testing the same algorithms with two different databases or baseline systems derives into two very different results, one proving the validity of the proposed algorithm and one otherwise.

In order to run meaningful and fair experiments using the algorithms proposed in this thesis one needs to define:

A baseline system, which acts as the comparison ground to all systems proposed and tested.
A common development and test datasets, based on the NIST RT evaluations datasets, in order for results to be comparable between experiments and to systems outside of the thesis.
A set of metrics in order to evaluate such systems with commonly used and available techniques.

In the following subsections each of these items is described as it has been used in this thesis for most of the experiments with the system's main blocks.

Subsections

user 2008-12-08