Speech/Non-Speech Detection Block

Experiments for the speech/non-speech module were obtained for the SDM case to make it directly comparable with the baseline system results shown in the previous section. Although in this case two slightly different development and test sets were used. The development set consisted on the RT02 + RT04s datasets (16 meeting excerpts) and the test set was the RT05s set (with exception of the NIST meeting with faulty transcriptions). Forced alignments were used to evaluate the DER, MISS and FA errors.

In the development of the proposed hybrid speech/non-speech detector there are three main parameters that need to be set. These are the minimum duration for the speech/non-speech segments in both the energy block and the models block, and the complexity of the models in the models block.

**Figure 6.1:** *Energy-based system errors depending on its segment minimum duration*
$\begin{figure} \centerline{\epsfig{figure=figures/spnsp_energy_error2,width=80mm, angle=-90}} \end{figure}$

The development set was used to first estimate the minimum duration of the speech and non-speech segments in the energy-based detector. In figure 6.1 one can see the MISS and FA scores for various durations (in # frames). While for a final speech/non-speech system one would choose the value that gives the minimum total error, in this case the goal is to obtain enough non-speech data to train the non-speech models in the second step. It is very important to choose the value with smaller MISS so that the non-speech model is as pure as possible. This is so because the speech model is usually assigned more Gaussian mixtures in the modeling step, therefore a bigger FA rate does not influence it as much. It can be observed how in the range between duration 1000 and 8000 the MISS rate remains quite flat, which indicates how robust the system is to variations in the data. In any new dataset, if it does not contain a minimum value for the MISS rate at the same value are in the development set, it will most probably still be a very plausible solution. A duration = 2400 (150ms duration) is chosen with MISS = 0.3% and FA=9.5% (total 9.7%).

**Figure 6.2:** *Model-based system errors depending on its segment minimum duration*
$\begin{figure} \centerline{\epsfig{figure=figures/spnsp_cluster_error2,width=80mm, angle=-90}} \end{figure}$

The same procedure is followed to select the minimum duration for the speech and non-speech segments decoded using the model-based decoder, using the minimum duration determined by the previous analysis of the energy-based detector. In figure 6.2 one can see the FA and MISS error rates for different minimum segment sizes (the same for speech and non speech); such curve is almost identical when using different # mixtures for the speech model, a complexity of 2 Gaussian mixtures for the speech model and 1 for silence is chosen. In contrast to the energy-based system, this second step does output a final result to be used in the diarization system, therefore it is a need to find the minimum segment duration that minimizes the total percent error. An minimum error of 5.6% was achieved using a minimum duration of 0.7 seconds. If the parameters in the energy-based detector that minimize the overall speech/non-speech error had been chosen (which is at 8000 frames, 0.5 seconds) instead of the current ones, the obtained scores would have had a minimum error of 6.0% after the cluster-based decoder step.

Table 6.3: Speech/non-speech errors on development and test data

sp/nsp system	RT02+RT04s			RT05s
	MISS	FA	total	MISS	FA	total
All-speech system	0.0%	11.4%	11.4%	0.0%	13.2%	13.2%
Pre-trained models	1.9%	3.2%	5.1%	1.9%	4.6%	6.5%
hybrid (1st part)	0.4%	9.7%	10.1%	0.1%	10.4%	10.5%
hybrid system(all)	2.4%	3.2%	5.6%	2.8%	2.1%	4.9%

In table 6.3 results are presented for the development and evaluation sets using the selected parameters, taking into account only the MISS and FA errors from the proposed module. Used as comparison, the ``all-speech'' system shows the total percentage of data labelled as non-speech in the reference (ground truth) files. After obtaining the forced alignment from the STT system, there existed many non-speech segments with a very small duration due to the strict application of the 0.3s minimum pause duration rule to the forced alignment segmentations. The second row shows the speech/non-speech results using SRI speech/non-speech system (Stolcke et al., 2005) which is was developed using training data coming from various meeting sources and its parameters optimized using the development data presented here and the forced alignment reference files. If tuned using the hand annotated reference files provided by NIST for each data set, it obtains a much bigger FA rate, possibly due to the fact that it is more complicated in hand annotated data to follow the 0.3s silence rule. The third and forth rows belong to the results for the presented algorithm. The third row shows the errors in the intermediate stage of the algorithm, after the energy-based decoding. These are not comparable with the other systems as the optimization in here is done regarding the MISS error, and not the TOTAL error. The forth row shows the result of the final output from both systems together.

Although the speech/non-speech error rate obtained for the development set is worse than what is obtained using the pre-trained system, it is almost a 25% relative better in the evaluation set. This changes when considering the final DER. In order to test the usability of such speech/non-speech output for the speaker diarization of meetings data the baseline system was used interposing either of the three speech/non-speech modules shown in table 6.3.

Table 6.4: DER using different speech/non-speech systems

sp/nsp system	Development	evaluation
All-speech	27.50%	25.17%
Pre-trained models	19.24%	15.53%
hybrid system	16.51%	13.97%

It is seen in 6.4 that the use of any speech/non-speech detection algorithm improves the performance of the speaker diarization system. Both systems perform much better than just using the diarization system alone. This is due to the agglomerative clustering technique, which starts with a large amount of speaker clusters and tries to converge to an optimum number of clusters via cluster-pair comparisons. As non-speech data is distributed among all clusters, the more non-speech they contain, the less discriminative the comparison is, leading to more errors.

In both the development and evaluation sets the final DER of the proposed speech/non-speech system outperforms by a 14% relative (development) and a 10% relative (evaluation) the system using pre-trained models. It can be seen how the DER on the development set is much better that the pretrained system, even though the proposed system has a worse speech/non-speech error. This indicates that the proposed system obtains a set of speech/non-speech segments that are more tightly coupled with the diarization system.