Speaker segmentation has sometimes been referred to as speaker change detection and is closely related to acoustic change detection. For a given speech/audio stream, speaker segmentation/ change detection systems find the times when there is a change of speaker in the audio. On a more general level, acoustic change detection aims at finding the times when there is a change in the acoustics in the recording, which includes speech/non-speech, music/speech and others. Acoustic change detection can detect boundaries within a speaker turn when the background conditions change.
Although erroneously, the term ``speaker segmentation'' has sometimes been used instead of speaker diarization for systems performing both a segmentation into different speaker segments and a clustering of such segments into homogeneous groups. As it will be pointed out later on, many systems obtain a speaker diarization output by means of first performing a speaker segmentation and then grouping the segments belonging to the same speaker. Other times this distinction is not so clear as segmentation and clustering are mixed together. In this thesis a system will be said to perform speaker segmentation when all frames, assigned to any particular speaker ID, are contiguous in time. Otherwise the system will be said to perform speaker segmentation and clustering (or equivalently speaker diarization).
On a very general level, two main types of speaker segmentation systems can be found in the bibliography. The first kind are systems that perform a single processing pass of the acoustic data, from where the change-points are obtained. A second broad class of systems are those that perform multiple passes, refining the decision of change-point detection on successive iterations. This second class of systems include two-pass algorithms where in a first pass many change-points are suggested (more than there actually are, therefore with a high false alarm error rate) and in a second pass such changes are reevaluated and some are discarded. Also part of the second broad class of systems are those that use an iterative processing of some sort to converge into an optimum speaker segmentation output. Many of the algorithms to find the change-points reviewed in this section (including all of the metric-based techniques) can either work alone or in a two-step system together with another technique.
On another level, a general classification of the methods available for speaker segmentation will be used in this section to describe the different algorithms available. In the bibliography (Ajmera (2004), Kemp et al. (2000), Chen et al. (2002), Shaobing Chen and Gopalakrishnan (1998), Perez-Freire and Garcia-Mateo (2004)) three groups are defined: metric-based, silence-based and model-based algorithms. In this thesis this classification will be augmented with a fourth group (called ``others'') to amalgamate all other techniques that do not fit any of the three proposed classes. In the next section the metric-based techniques are reviewed in detail and in 2.2.2 the other three groups are treated.