Overall Experiments and Analysis of Results

In the previous section a majority of the algorithms proposed in this thesis for use in speaker diarization have been analyzed, first by themselves comparing them to the baseline and then as part of an agglomerate system in order to obtain an optimum final system.

Table 6.23: Summary of average DER for the agglomerate system on development and evaluation data

System	DER	Improv.	Improv. vs.	DER	Improv.	Improv. vs.
	devel	vs. prior	baseline	eval	vs. prior	baseline
Baseline	18.71%	-	-	23.23	-	-
multi-stream weights	17.93%	4.16%	4.16%	23.97%	-3.18%	-3.18%
# init clusters + complexity	17.19%	4.12%	8.12%	23.18%	3.29%	0.21%
Friends-and-enemies init	17.77%	-3.37%	-	23.79%	-2.63%	-
CV-EM Training	17.17%	0.11%(*)	8.23%	21.79%	5.99%(*)	6.19%
Frame purification	16.77%	2.32%	10.36%	20.16%	7.48%	13.21%
Segment purificaton	16.82%	-0.29%	-	20.55%	-1.93%	-

Table 6.23 shows a summary of the results analyzed in the previous section and computes the relative improvement of each algorithm with respect to the previous one and to the baseline (accumulating all improvements). Both the friends-and-enemies and segment purification algorithms obtain bad results in this experiment and therefore are not included in the agglomerate system result (results with * indicate that the relative improvement to the prior system computes it not taking them into account). The algorithms not included in this final experiment are still valid and obtain good results in certain situations, but not in the average of all cases.

Taking into account the average DER between the SDM, MDM and TDOA-MDM system outputs, and tuning all the algorithms to them, the proposed algorithms in the diarization module improve up to a 10.36% relative in the development set and up to a 13.21% relative in the evaluation set.

While such approach of optimizing the average DER obtains systems that will perform well for all the different tasks considered. In some applications where multiple microphones are available it is interesting to find the best result possible. To obtain it the TDOA-MDM system has been selected and the parameters optimized to it according to the parameter sweeps performed in the previous section. The parameters selection was then:

Table 6.24: Results for the TDOA-MDM task using different algorithm settings

System	DER development	DER evaluation
Optimum average system	14.92%	15.55%
Optimum TDOA-MDM system	14.64%	14.76%
best TDOA-MDM devel	13.78%	17.22%

The obtained results are shown in table 6.24 where DER results are shown only for the TDOA-MDM task, which is usually the one with best performance. By using the optimum parameters according to the development sweeps in the previous section, and including all algorithms into the system, the resulting optimum TDOA-MDM system obtains an improvement of 5.08% on the evaluation set versus the optimum average system optimized for the average DER for all three systems.

The final system has a robust performance over changes in the data, sometimes at the cost of not obtaining the absolute minimum DER in all cases. To illustrate this, let us take the system labelled as best TDOA-MDM on the development, which is a system built only using the automatic weighting algorithm and the definition of number of clusters and complexity of the models. This system outperforms vastly the optimum systems in the development set but when changing to a different set it returns a poor performance. By using the optimum systems once all algorithms are in place, the results are even in both sets at the expense of some increase int the DER in the development.

Table 6.25: Overall thesis scores comparison

System	DER development	DER evaluation
BN baseline (SDM)	24.88%	19.80%
Meetings baseline (MDM)	19.04%	26.5%
Optimum TDOA-MDM system	14.64%	14.76%

Table 6.25 shows the DER scores to illustrate the overall improvement achieved by the system while transforming from broadcast news speaker diarization to diarization for meetings. The BN baseline system shows the DER of the described baseline which uses model-based speech/non-speech detection. As pointed out, even though this system is already a step forward from the system at the start of this thesis work, it acts as a good baseline for all the work done in beamforming and speaker diarization.

The meetings baseline is the same baseline as in the BN system, but using the hybrid speech/non-speech detection and the baseline RT06s beamforming. Finally, the optimum TDOA-MDM system has been presented earlier in this section and shows the optimum/robust results obtained by using all proposed algorithms.

The optimum TDOA-MDM system obtains an outstanding 41.15% relative improvement on the development set compared to the BN baseline, and a 25.45% relative improvement on the evaluation set. As shown over the experiment sections, these improvements are due to all the new and refurbished algorithms proposed for the system, being the MDM beamforming and the inclusion of TDOA features into the diarization the two most outstanding components.

One Interesting result in the BN baseline system is the outstanding difference between development and evaluation results. While the system performs rather poorly on the development set, for that particular combination of parameters it obtains a very good result on the evaluation set. Running any acoustic beamforming using more microphones than just the SDM results always in an increase on the DER. In fact, an experiment using the RT06s beamforming on top of the BN baseline achieves a 23.63% DER on dev set (a slight improvement) and 26.3% DER on the evaluation set (much worse, similar to the result for the meetings baseline).

This is another example of show flakiness and lack of robustness on the baseline system. On one hand, while one set of data performs well, another set can perform very poorly for the same parameters setting. On the other hand, when doing changes to the system, not all datasets perform the same way, achieving an improvement on one set does not mean that it translates to others. This problem is the keystone of research in speaker diarization and has been a main concern and issue during all the thesis development, as random results jeopardize the assessment of new techniques that, although beneficiary to the system, might have been discarded due to poor performance.

Given each independent recording (meetings, broadcast news or other sources), the speaker diarization algorithm processes it and obtains an output segmentation. Such segmentation might show a slight improvement due to the applied systems but could also obtain a very high DER due to factors like a badly chosen stopping point. By considering the final DER as the time-weighted sum of all excerpts, if a few of them experienced such bad behavior then the final score is worse than previous runs, misleading to thinking that the tested algorithm is not correct. When computing the DER for a small set of excerpts (8 to 10), these errors cause a big impact in the final score.

While the DER score is a standard way of measuring diarization performance as used in the NIST RT evaluations, which is the ultimate metric to reduce to show improvements in the systems, in order to avoid the problems posed earlier there are several alternatives. On one hand the use of bigger development and evaluation sets so as to reduce the effect of these outliers. To work with such datasets there must be accurate and consistent transcriptions to test against, which should be obtained using automated mechanisms like forced-alignments. On the other hand, the DER metric could be altered to eliminate the outlier scores from the average during development. Although this would solve the occasional excerpts with big errors, it does not help improve those that are considered ``hard nuts'' (shows that always perform very badly) and it is therefore difficult to define the outlier boundaries that describe the system correctly.

Table 6.26: Overall thesis DER scores comparison

System / DER	RT02s	RT04s	RT05s	all dev.	eval (RT06s)
BN baseline (SDM)	35.20%	28.26%	23.26%	27.82%	37.67%
Optimum TDOA-MDM system	27.86%	25.05%	17.47%	22.27%	31.75%

Finally, for comparison purposes, table 6.26 shows the initial and final systems evaluated this time using the hand-alignments proposed by NIST during the evaluation campaigns and splitting results into evaluation sources. These are used for comparison only as development using these references was stopped and switched to the forced-alignment references, much more robust across years.