Clusters Comparison, Pruning and Clusters Merging

Given M clusters with their corresponding models, the matrix of distances between every cluster pair is created and the closest pair is merged if it is determined that both clusters contain data from the same speaker. In order to obtain a measure of similarity between two clusters modeled by a GMM a modified version of the Bayesian Information Criterion (BIC) is used, as introduced by Ajmera in Ajmera, McCowan and Bourlard (2004) and Ajmera and Wooters (2003). As explained in the state of the art chapter, the BIC value quantifies the appropriateness of a model given the data. It is a likelihood-based metric that introduces a penalty term (in the standard formulation) to penalize models by their complexity.

Given two clusters $ C_{i}$ and $ C_{j}$ composed of $ N_{i}$ and $ N_{j}$ acoustic frames respectively, they are modeled with two GMM models $ \mathcal{M}_{i}$ and $ \mathcal{M}_{j}$. The data used to train each of the models is the union of the data belonging to each one of the segments labelled as belonging to cluster $ C_{.}$. In the same manner a third model $ \mathcal{M}_{i+j}$ is trained with $ C_{i+j} = C_{i} \bigcup C_{j}$.

In the standard BIC implementation to compare two clusters (Shaobing Chen and Gopalakrishnan, 1998) the penalty term adds a factor $ \lambda$ that is used to determine the effect of the penalty on the likelihood. The equation of the standard $ \Delta$BIC for GMM models is

$\displaystyle \Delta BIC(C_{i}, C_{j})$   $\displaystyle = \mathcal{L}(C_{i+j}
\vert\mathcal{M}_{i+j})) - (\mathcal{L}(C_{i}\vert\mathcal{M}_{i})) +
    $\displaystyle - \frac{1}{2} \lambda (\char93  M_{i+j} - \char93  M_{i} - \char93  M_{j}) \log
(N_{i}+N_{j})$ (3.1)

where $ \char93  M_{.}$ is the number of free parameters to be estimated for each of the models, i.e. relative to the topology and complexity of the model.

It is considered that two clusters belong to the same speaker if they have a positive $ \Delta$BIC value. Such value is affected by the penalty term (including the $ \alpha$, which acts as a threshold determining which clusters to merge and which not to). The penalty term also modifies the order in which the cluster pairs are merged in an agglomerative clustering system, as each pair will have a different number of total frames and/or models complexities, which will cause the penalty term to be different. In systems based on $ \Delta$BIC this penalty term ($ \lambda$) is usually tuned based on development data and it always takes values greater than 0. In some cases two different values are defined, one for the merging criterion and the other one for the stopping criterion.

When training models to be used in $ \Delta$BIC, these need to be trained using an ML approach, but there is no constraint in the kind of model to use. Ajmera's modification to the traditional BIC formula comes with the inclusion of a constraint to the combined model $ M_{i+j}$:

$\displaystyle \char93  M_{i+j} = \char93  M_{i} + \char93  M_{j}$ (3.2)

This is an easy to follow rule as model $ M_{i+j}$ is normally built exclusively for the comparison.

By Applying this rule to the $ \Delta$BIC formula one avoids having to decide on a particular $ \lambda$ parameter and therefore the real threshold which is applied to consider if two clusters are from the same speaker becomes 0. The formula of the modified-BIC becomes equivalent to the GLR, but with the condition that 3.2 applies.

The lack of an extra tuning parameter makes the system more robust to changes in the data to be processed, although, as the BIC formula is just an approximation of the Bayesian Factor (BF) formulation, sometimes the robustness increase comes with a small detriment on performance.

In the broadcast news system that has been described, the model $ M_{i+j}$ is generated directly from the two individual models $ i$ and $ j$ by pooling all the Gaussian mixtures together. Then the data belonging to both parent models is used to train the new model via ML. Training is always performed using an Expectation Maximization (EM-ML) algorithm and performing 5 iterations on the data.

Once the $ \Delta$BIC metric between all possible cluster pairs has been computed, it searches for the biggest value and if $ \Delta
BIC_{max}(C_{i}, C_{j}) > 0$ the two clusters are merged into a single cluster. In this case, the merged cluster is created in the same way as $ M_{i+j}$ is created, although it is not a requirement. The total complexity of the system remains intact through the merge as the number of Gaussian mixtures representing the data is the same, clustered in M-1 clusters.

The computation of $ \Delta$BIC for all possible combinations of clusters is by far the most computationally intensive step in the agglomerative clustering system. Given that the models are retrained and the data is resegmented after each merge it obtains models at each iteration that are quite dissimilar to the models in the previous iteration, therefore it is recommended to compute all values again. Some techniques were tested to speedup this process by looking at the behavior of the $ \Delta$BIC values. Some are:

user 2008-12-08