The proposed initialization algorithm, called friends-and-enemies initialization, is designed to split the acoustic data to be processed into clusters, where is determined beforehand by the algorithm presented in 4.2.2 or manually set by the user. In the agglomerative clustering scheme presented for the meeting domain, it corresponds to the initial number of clusters used to start the agglomerative process. Each of the resulting initial clusters has a duration which is not restricted to be equal to any other cluster.
The complete initialization is composed of three distinct blocks, as shown in Figure 4.6. The first block performs a speaker-change detection on the acoustic data to identify segments with a high probability of containing only one acoustic event. Such acoustic events can either be silence, various noises, an individual speaker or various speakers overlapping each other. This first step is performed using the modified Bayesian Information Criterion (BIC) metric (introduced by Ajmera and Wooters (2003)) computed between two models created from the data in two adjacent windows of size , connected at the evaluated possible change point. The modified BIC metric is computed over all the acoustic data every frames. A possible change point is selected if , it corresponds to a local minimum of the BIC values around it, and there is no other possible change point with smaller BIC value which is closer than frames to it. In the implementation second windows are used, with a scroll seconds. Each window is modeled using a model with 5 Gaussian mixtures (therefore with 10 Gaussians for the combined model) and a of 3 seconds, equal to the minimum speaker turn duration used in the following agglomerative-clustering process.
The second block in the initialization algorithm creates clusters by identifying the segments defined in the first part as friends or enemies of each other. It is considered that two given acoustic segments are friends if they contain acoustically homogeneous data; only the best friends are brought together to form a cluster. In the same way, it is considered that two segments are enemies if they contain very dissimilar acoustic data. It is intended to obtain final enemy groups (the desired final number of clusters) consisting of segments each, which are friends of each other. Three different similarity metrics were experimented with to compare each segment pair and . On the first place a geometric mean of the frame cross-likelihood as
where is the number of frames in segment , and is a model with 5 Gaussian Mixtures trained with .
The second metric normalizes each term by the number of frames in the linear domain instead, resulting in a penalty to the cross-likelihood as
The third metric does a full cross-likelihood as introduced by Rabiner in Juang and Rabiner (1985)
All three metrics are bigger the closest the segments are to each other. In order to initiate the process one needs to define an initial segment. Again, three criteria have been considered:
Figure 4.6 shows an example case on how the algorithm works. In the horizontal axis the speaker segments as found by the first block, are represented. The vertical axis shows the distance value associated to each segment. In step (0) the initial segment needs to be determined. In this example the criterion 2 is used to find the segment with smallest averaged likelihood, .
Then, in step (1a) the data in is used to train a model with 5 Gaussian mixtures ( ) and compute either metric between itself and all other segments. The segments with bigger value are its friends. In this example, and the selected friends for are chosen to be and . In step (1b), a new model is trained from all data in this first cluster ( ) and the same metric as before is computed, except that now it is measured between all segments in the model and each of the remaining segments.
A new enemy is selected as the segment with smaller value to the first cluster. Also in the same way, in step (2a) friends are chosen for and in (2b) we select a new enemy for both previously established clusters. This is done by computing the sum of the used metric for each segment given all predefined groups. The processing continues until the desired number of initial clustering is reached or it runs out of free segments.
At that point in the third block all created models are used to reassign the acoustic data into the classes. This is done using a Viterbi decoding where the resulting segmentation is not constrained to the predefined speaker changes, therefore any previous speaker change detection errors can be corrected. All data gets assigned to its closest cluster, classifying any acoustic frames not assigned in the previous block. Finally, one cluster model is trained from each of the resulting clusters.