Initial models (for example GMMs) are created for a closed set of acoustic classes (telephone-wideband, male-female, music-speech-silence and combinations of them) by using training data. The audio stream is then classified by ML (Maximum Likelihood) selection using these models (Gauvain et al. (1998), Kemp et al. (2000), Bakis et al. (1997), Sankar et al. (1998), Kubala et al. (1997)). The boundaries between models become the segmentation change points. One could also consider the decoder-guided systems to be model-based, as they model each phoneme and silence, but here researchers try to distinguish among broader classes, instead of models derived from speech recognition and trained for individual phones.
This segmentation method resembles very closely the speaker clustering techniques where the identity of the different speakers (in this case acoustic classes) is known a priori and an ML segmentation is found. Both areas have a robustness problem given that they require initial data to train the models. As will be shown in the speaker clustering section, in recent years there has been research done on the topic of blind speaker clustering, where no initial information of the clusters is known. There is some of this research that applies these techniques to speaker segmentation, in particular some clustering systems make use of an ML decoding of evolutive models that look for the optimum acoustic change points and speaker models at the same time.
In Ajmera et al. (2002) and Ajmera and Wooters (2003) the iterative decoding is done bottom-up (starting with a large number of speaker changes as the product of a first step of processing and then eliminating them until obtaining the optimum amount) and in Meignier et al. (2001) and Anguera and Hernando (2004a) it is done top-down (starting with one segment and adding extra segments until the final number is reached).
Meignier et al. (2004) analyzed the use of evolutive systems where pretrained models are also used for modeling background conditions, showing that in general the more prior information that can be given to the system the better performance it achieves.
All of these systems use Gaussian Mixture Models (GMM) to model the different classes and an ML/Viterbi decoding approach to obtain the optimum change points. In Lu et al. (2001) SVMs (Support Vector Machines) are used as a classifier instead of GMM models and the ML decoding, training them using pre-labelled data.
user 2008-12-08