Speech and Non-Speech Modeling

Figure 4.9: Speech-silence histogram for a full meeting
\begin{figure}
\centerline{\epsfig{figure=figures/pur_histogram,width=65mm,
angle=-90}}
\end{figure}

In order to see the effect of typical acoustic speaker models with non-speech data an experiment was performed on all the data belonging to an ICSI meeting used in the RT04s evaluation. All the acoustic frames $ X$ from that meetings were split into speech frames $ X_{0}$ and non-speech frames $ X_{1}$ according to the reference segmentation file provided by NIST. A speaker model with 5 Gaussian Mixtures was trained using only the speech-labelled frames $ X_{1}$. Then both speech and non-speech frames were evaluated using such model and two normalized histograms were created from the resulting likelihood scores, as can be seen in Figure 4.9.

The scores of the non-speech frames $ X_{0}$ are mainly located in the higher part of the histogram, indicating that $ X_{0}$ usually obtains higher likelihood scores than $ X_{1}$ even when evaluating it on a model trained only with $ X_{1}$ data. Part of the $ X_{1}$ frames are also in the upper part of the histogram, which are most probably non-speech frames that are labelled as speech in the reference file. Even with the use of a speech/non-speech detector, a residual error of around 5% of non-speech data enters the clustering system. In order to purify a cluster both the non-speech (undetected) data and the speech-labelled non-speech data needs to be eliminated while maintaining the rest of acoustic frames that discriminate between speakers. It is clear that likelihood can be used to detect and filter out these frames.

Figure 4.10: Observed assignment of frames to Gaussian mixtures
\begin{figure}
\centerline{\epsfig{figure=figures/pur_gmm_sil,width=100mm}}
\end{figure}

A possible explanation for this behavior is illustrated in Figure 4.10 where a cluster model $ \Theta_{A}$, using M Gaussian mixtures, is trained using acoustic data $ X_{1}$ labelled as speech by the speech/non-speech detector. After training the model, a group of Gaussian mixtures $ M_{1}$ adapt their mean and variances to model the subset of the speaker data $ X_{1,1}$ , while another group of Gaussians $ M_{2}$ appears to model the subset of data $ X_{1,2}$ which are nons-speech frames remaining in $ X_{1}$. Since the number of frames in $ X_{1,1}$ is typically much larger than those of $ X_{1,2}$, the number of Gaussian mixtures ssociated to each subgroup are $ \vert M_{1}\vert >> \vert M_{2}\vert$ and, at times, $ \vert M_{2}\vert$ could be 0 if the non-speech data is minimal. Furthermore, the variance of the non-speech Gaussian mixtures in $ M_{2}$ is always much smaller than $ M_{1}$. This is the reason why any non-speech frame evaluated by the model gets a higher score than a speech frame. This is taken advantage of in the frame level purification algorithm.

Figure 4.11: Evaluation of metric 1 on two clusters given their models
\begin{figure}
\centerline{\epsfig{figure=figures/pur_cross_lkld,width=80mm,
angle=-90}}
\end{figure}

To further prove that the acoustic frames with a higher likelihood are those which are less suitable to discriminate between speaker models another experiment was performed taking two speaker clusters trained with acoustic data for two different speakers according to the reference segmentation. Figure 4.11 illustrates the relationship between the likelihood scores of the data used in training each of the two models and evaluated on both models. It is possible to determine an axis between the likelihood values of the two models. The distance to this axis indicates the discriminative power of the data from each cluster. Frames from both clusters with the highest likelihood values are grouped together on this axis, indicating how badly they can differentiate between speakers.

user 2008-12-08