Speaker Clusters and Models Initialization

In order to initialize the hierarchical bottom-up agglomerative clustering one needs to first define an initial number of clusters $ K_{init}$, bigger than the optimum number of clusters $ K_{opt}$. The system defined for broadcast news used $ K_{init}=40$ clusters, value chosen empirically given some development data. It was found that even though the optimum number of clusters in a recording is independent of the length of such recording, in terms of selecting an initial number of clusters for the agglomerative system the total length of the available data has to be considered to allow for clusters to be well trained and best represent the speakers. By making the $ K_{init}$ constant for any kind of data used in the system makes some recordings do not perform as well since the initial models either contain too much or too few acoustic data. In the system presented here for meetings, this initial number is made dependent on the amount of data after the speech/non-speech detection. A new parameter called Cluster Complexity Ratio (CCR) represents the relationship between data and cluster complexity. The algorithm used is further described in detail in 4.2.2.

The same CCR parameter is also used throughout the agglomerative clustering process to determine the complexity (number of Gaussian mixtures) of the speaker models. Such mechanism ensures that all models remain at a complexity relative to the amount of data that they are trained with, and therefore remain comparable to each other. This is further explained in section 4.2.2.

Given the data assigned to each cluster, in order to obtain an initial GMM model with a certain complexity the technique used in the baseline system has been replaced by another one in order to obtain better initialized models. It was seen in experiments that the initial models play an important role in the overall performance of the system as the initial position for the mixtures is an important factor in how well the model can be trained using EM-ML and therefore how representative it will be of the data. This is particularly crucial in speaker diarization where small models (initially 5 Gaussians) are used due to little training data.

The broadcast news system uses a method that resembles the HCompV routine in the HTK toolkit (Young et al., 2005) for initialization without a reference transcription. Given a set of acoustic vectors $ X = \{x[1] \dots x[N]\}$ and a desired GMM with complexity M Gaussians, the first Gaussian is computed via the sufficient statistics of the data $ X$ as

$\displaystyle \mu_{1}=\frac{1}{size(X)}\sum_{i=1}^{N} x[i]$

$\displaystyle \sigma^{2}_{1} = \frac{1}{M} (\frac{1}{N} \sum_{i=1}^{N} x^{2}[i] -

For the rest of the Gaussian mixtures, equidistant points in $ X$ are chosen as means and the same variance as in Gaussian 1 is used:

$\displaystyle \mu_{i} = X[i \cdot \frac{N}{M} ]$

$\displaystyle \sigma^{2}_{i} = \sigma^{2}_{1}$

with Gaussian weights kept equal for all mixtures, $ W_{i} = \frac{1}{M}$.

This method has two obvious drawbacks. On one hand, as pointed out above, this technique does not consider a global ML approach and therefore Gaussian mixtures can easily end up in local maxima. On the other hand, it does not ensure that all the acoustic space of the acoustic data is covered by the positioned Gaussians.

Figure 3.7: Speaker models initialization based on Gaussian splitting
\centering {\epsfig{figure=figures/init_sv,width=100mm}}

The introduced technique is inspired on the split and vanish techniques used in the GMTK toolkit (Bilmes and Zweig, 2002) and the mixture incrementing function in HTK. As seen in figure 3.7, the initial mean and variance of data $ X$ are computed in the same way as in the previous technique (step 1). Then the algorithm iteratively splits each of the $ M'_{prev}$ Gaussian mixtures into two mixtures, obtaining a total of $ M'_{new}$ mixtures, while $ 2M'_{new}<M$, the desired model complexity. The $ M'_{new}$ Gaussian mixtures are computed from their previous counterpart by

$\displaystyle \sigma^{2}_{new1} = \sigma^{2}_{new2} = \sigma^{2}_{prev}$

$\displaystyle \mu_{new1}=\mu_{prev} + 0.2\sigma_{prev} \ \ \ \mu_{new2}=\mu_{prev} - 0.2\sigma_{prev}$

$\displaystyle W_{new1} = W_{new2} = \frac{W_{prev}}{2}$

After each split, a single step EM training of the current models given data $ X$ is performed to allow for the Gaussian mixtures to adapt mean and variance to the data.

Once an extra splitting iteration would overpass the desired number of desired Gaussian mixtures, the algorithm moves into a single Gaussian split mode (step 3). In it the Gaussian selected to split is the one with the highest weight, and it is split in the same way as shown before. Some experiments were performed with different alternative splitting/vanishing procedures but to initialize GMM models with a small number of Gaussian mixtures it was seen that performance would diminish any time that vanishing was applied, therefore the technique applied here only uses a splitting procedure. Also, the defunct function implemented by HTK to discard Gaussians with low weigh was seen to be perjudicial for the GMM models grown here.

Once the number of initial cluster $ K_{init}$ is defined, in the broadcast news system it was explained how speaker clusters were initialized by evenly assigning the available data into the different clusters and doing several segmentation-training iterations to allow for homogeneous data to cluster together. While this mechanism is very simple and gives surprisingly good results, it does not ensure that the final clusters contain only data from one cluster (i.e. with a high purity).

In order to improve on the linear initialization technique, several alternative methods were tested, including K-means at the segment level, E-HMM top-down clustering (Meignier et al., 2001) and others, finally designing a brand new algorithm that has been called the friends-and-enemies initialization and is further explained in section 4.2.1.

user 2008-12-08