RT06s Official Performance Scores

In this section the NIST official scores are shown for all of the ICSI systems presented in the RT06s evaluation in the speaker diarization (SPKR) task and the speech activity detection (SAD) task. In RT06s the main metric used was DER including the speaker overlap regions. In tables 7.2 and 7.3, the SPKR results are shown both for conference and lecture room data, and in table 7.4 the SAD results are shown. During the development of the systems for RT06s focus was switched at using forced-alignments as reference segmentations instead of hand-alignments, which were believed to be less reliable. In all cases in the results tables they show both the official hand-made references and the forced-alignment references.

In general, results for RT06s using hand-alignments were much worse than in previous years for conference room, which was not so pronounced when evaluating results using the forced alignments. This might be due to the increased complexity of the data and of a decrease in the quality of the hand-generated transcriptions for RT06s evaluation.

Table 7.2: Results for RT06s Speaker Diarization, conference room environment

Cond.	System ID	%DER MAN	%DER FA
MDM	p-wdels	35.77%	19.16%
	c-newspnspdelay	35.77%	20.03%
	c-wdelsfix	38.26%	23.32%
	c-nodels	41.93%	27.46%
	c-oldbase	42.36%	27.01%
SDM	p-nodels	43.59%	28.25%
	c-oldbase	43.93%	28.21%

Table 7.3: Results for RT06s Speaker Diarization, lecture room environment

Cond.	System ID	%DER MAN	%DER MAN(subset)	%DER FA(subset)
ADM	p-wdels	12.36%	11.54%	10.56%
	c-nodels	10.43%	10.60%	9.71%
	c-wdelsfix	11.96%	12.73%	11.58%
	c-guessone	25.96%	23.36%	24.51%
MDM	p-wdels	13.71%	11.63%	10.97%
	c-nodels	12.97%	13.80%	13.09%
	c-wdelsfix	12.75%	12.95%	12.34%
	c-guessone	25.96%	23.36%	24.51%
SDM	p-nodels	13.06%	12.47%	11.69%
	c-guessone	25.96%	23.36%	24.51%
MSLA	p-guessone	25.96%	23.36%	24.51%

In the SPKR task for conference room a substantial improvement can be seen between the first three systems in MDM and the last two due to using delays as features in diarization. In lecture room data (Table 7.3, third column) the use of delays affects negatively the performance, possibly due to the existence of people moving around the room (delays consider a different speaker for each location).

Figure 7.3 shows the DER per meeting for each of the presented systems. It is interesting to observe that the primary MDM system (mdm_p-wdels) obtains flatter scores for all the shows than using last year's system, labelled as mdm_c-newspnspdelay. Both are shown in dashed lines in figure 7.3.

**Figure 7.3:** *DER break-down by show for the RT06s conference data*
$\begin{figure} \centering {\epsfig{figure=figures/RT06s_conf,width=120mm}} \end{figure}$

In general the more microphones available for processing, the better the results. As the diarization system is the same, the improvement is thanks to the filter&sum processing. This is clear in the conference room data, while in the lecture room data results are mixed. It is believes that this is due to the big difference in quality between the microphone used in SDM and all others.

In the lecture room results shown in Table 7.3 a comparison is made between the manual and forced-alignment DER for all systems submitted. The third column shows the results using the latest release of the manual reference segmentations (18 meeting segments). When generating the forced-alignments using the IHM channels from each individual speaker we could not produce them for the meeting segments containing speakers not wearing any headset microphone. The last column shows results using forced-alignment references for a subset of 17 meeting segments containing all speakers who wore a headset microphone. The second to last column shows results using this same subset and using hand-alignments for comparison purposes.

Results using FA references are much better than using hand-alignments in the conference room, while they remain similar in lecture room (with a constant improvement of 0.5% to 1% for FA). It is believed that the conference room manual references contain many human-created problems, which were filtered out in the lecture room references after several redistributions of references.

Figure 7.4 shows the break-down of the DER for all presented systems for the lecture room data. Some meetings are much harder to process than others, creating spikes in the DER curves, more or less pronounced depending on the system. In some cases the ADM systems perform as well in these ``hard'' meetings as in the easier ones.

**Figure 7.4:** *DER break-down by show for the RT06s lecture data*
$\begin{figure} \centering {\epsfig{figure=figures/RT06s_lect,width=120mm}} \end{figure}$

On the other hand, in table 7.4 results are shown for systems on conference and lecture room data for the SAD task, using the new speech/non-speech detector developed for RT06s evaluation.

Table 7.4: Results for RT06s Speech Activity Detection (SAD). Results with * are only for a subset of segments

Env.	Cond.	%DER MAN	%DER MAN	%DER FA
	(%MISS, %FA)	(subset)
Conference	MDM	23.51 (22.76, 0.8)	-	11.10 (7.80, 3.30)
	SDM	24.95 (24.24, 0.8)	-	11.50 (8.80, 2.70)
Lecture	ADM	13.22 (9.3, 3.9)	7.9 $^{*}$ (5.0, 2.9)	7.2 $^{*}$ (3.7, 3.5)
	MDM	13.83 (9.3, 4.5)	6.5 $^{*}$ (5.0, 1.5)	5.6 $^{*}$ (3.6, 2.0)
	SDM	14.59 (10.0, 4.6)	7.2 $^{*}$ (4.5, 2.7)	6.7 $^{*}$ (3.3, 3.4)

The RT06s speech/non-speech detector was developed using forced-alignment (FA) data. Therefore the results of the SAD are better as shown in the forced-alignment column. The increase in % MISS in the hand-aligned conference data compared to the FA results is probably due to silence regions (greater than 0.3s) that are correctly labelled by the FA transcriptions but are considered speech by the hand-alignments.

As was done for the diarization experiments, a subset of meetings was created to appropriately evaluate the lecture room systems using forced-alignment references, and the counterpart hand-alignments for completeness. One initial observation is that the error rate decreases dramatically when evaluating only a subset of the shows using hand-alignments. Possible explanations for this are transcription errors produced due to the lower quality of the non-headset microphones used in the eliminated set of meetings, and/or an overall decrease of quality on these meetings for causes other than the transcription process.

As in the diarization results, these experiments also obtain better results the more microphones used, thanks to the filter&sum module. When comparing the forced-alignment with the hand-alignment subset the first group keeps a better balance between misses and false alarms, indicating that parameters defined in development translate robustly to the evaluation data.

Overall, for RT06s there was a big improvement with the use of delays between microphones as a feature in the diarization process for conference room data, while mixed results were obtained in lecture room. Also, a general improvement was observed using filter&sum on as many microphone signals as possible.