These techniques detect speaker changes hypothesizing that most changes between speakers will be through a silence segment. These have been traditionally implemented for using the segments for speech recognition, as it is very important to obtain clean speaker changes without cutting any words in half. Systems falling into this category are energy-based and decoder-based systems.
The energy-based systems use an energy detector to find the points where it is most probable to exist a speaker change. The detector normally obtains a curve with minimum/maximum points in potential silences. A threshold is usually used to determine them (Kemp et al. (2000), Wactlar et al. (1996), Nishida and Kawahara (2003)). In Siu et al. (1992) the MAD (Mean absolute deviation statistic), which measures the variability in energy within segments, is used instead in order to find the silence points.
In contrast, decoder-guided segmenters run a full recognition system and obtain the change points from the detected silence locations (Kubala et al. (1997), Woodland et al. (1997), Lopez and Ellis (2000b), Liu and Kubala (1999), Wegmann et al. (1998)) they normally constrain the minimum duration of the silence segments to reduce false alarms. Some of these systems use extra information from the decoder, such as gender labels (Tranter and Reynolds, 2004) or wide/narrow band plus music detectors (Hain et al., 1998). The output has normally been used as an input to recognition systems, but not for indexing or diarization as there is not a clear relationship between the existence of a silence in a recording and a change of speaker. In such systems they sometimes take these points as hypothetical speaker change points, and then use other techniques to define which of them actually mark a change of speaker and which do not.