Frame-Based Cluster Purification Metrics

In order to detect and filter out the non-speech frames using the detected likelihood property of the non-speech data, two variants of a likelihood-based metric are proposed.

$\displaystyle \bar{\mathcal{L}}(x[i] \vert \Theta_{A}) = \frac{1}{Q} \sum_{j=-... ...-1} \sum_{m=1}^{\widetilde{M}} log\Big(W_{A}[m] \mathcal{N}_{A,m}(x[i+j])\Big)$

(4.15)

The two metrics are based in equation 4.15 where defines the length of an average window and is used to average the measure around the desired value to avoid noisy values; $\widetilde{M}$ is the number of Gaussian mixtures used to compute the likelihood (where $\widetilde{M} < M$ , the number of mixtures in the model); $W_{A}[m]$ is the mixture weight and $\mathcal{N}_{A,m}(x[i+j])(x[\cdot])$ is the result of evaluating $x[\cdot]$ on the Gaussian mixture $\mathcal{N}_{A,m}(x[i+j])$ :

Metric 1: A standard smoothed likelihood over 100ms of data ( with 10ms acoustic frames) around each acoustic frame, with $\widetilde{M} = M$ (all mixtures in model $\Theta_{A}$ ).
Metric 2: The same smoothed likelihood (over 100ms) given a model formed by a subset of all Gaussian mixtures in the speaker model, which include the mixtures assigned to non-speech. The mixtures used are selected by computing the sum of variance over all dimensions and selecting those with smaller accumulated variance, $\widetilde{M}=M_{non-speech}$ . This second metric is equivalent to metric 1 when 100% of the Gaussian mixtures are selected.

user 2008-12-08