In order to see the effect of typical acoustic speaker models with
non-speech data an experiment was performed on all the data
belonging to an ICSI meeting used in the RT04s evaluation. All the
acoustic frames 
 from that meetings were split into speech
frames 
 and non-speech frames 
 according to the
reference segmentation file provided by NIST. A speaker model with
5 Gaussian Mixtures was trained using only the speech-labelled
frames 
. Then both speech and non-speech frames were
evaluated using such model and two normalized histograms were
created from the resulting likelihood scores, as can be seen in
Figure 4.9.
The scores of the non-speech frames 
 are mainly located in
the higher part of the histogram, indicating that 
 usually
obtains higher likelihood scores than 
 even when evaluating
it on a model trained only with 
 data. Part of the 
frames are also in the upper part of the histogram, which are most
probably non-speech frames that are labelled as speech in the
reference file. Even with the use of a speech/non-speech detector,
a residual error of around 5% of non-speech data enters the
clustering system. In order to purify a cluster both the
non-speech (undetected) data and the speech-labelled non-speech
data needs to be eliminated while maintaining the rest of acoustic
frames that discriminate between speakers. It is clear that
likelihood can be used to detect and filter out these frames.
A possible explanation for this behavior is illustrated in
Figure 4.10 where a cluster model 
, using M
Gaussian mixtures, is trained using acoustic data 
 labelled
as speech by the speech/non-speech detector. After training the
model, a group of Gaussian mixtures 
 adapt their mean and
variances to model the subset of the speaker data 
 ,
while another group of Gaussians 
 appears to model the
subset of data 
 which are nons-speech frames remaining in
. Since the number of frames in 
 is typically much
larger than those of 
, the number of Gaussian mixtures
ssociated to each subgroup are 
 and, at times,
 could be 0 if the non-speech data is minimal.
Furthermore, the variance of the non-speech Gaussian mixtures in
 is always much smaller than 
. This is the reason
why any non-speech frame evaluated by the model gets a higher
score than a speech frame. This is taken advantage of in the frame
level purification algorithm.
To further prove that the acoustic frames with a higher likelihood are those which are less suitable to discriminate between speaker models another experiment was performed taking two speaker clusters trained with acoustic data for two different speakers according to the reference segmentation. Figure 4.11 illustrates the relationship between the likelihood scores of the data used in training each of the two models and evaluated on both models. It is possible to determine an axis between the likelihood values of the two models. The distance to this axis indicates the discriminative power of the data from each cluster. Frames from both clusters with the highest likelihood values are grouped together on this axis, indicating how badly they can differentiate between speakers.
user 2008-12-08