This chapter covers the main contributions of this thesis in the area of acoustic modeling for speaker diarization in the meeting domain. As pointed out earlier, these algorithms were defined either to improve an existing algorithm in the baseline system or created new to solve problems detected in the system.
This chapter is structured into three main sections. The first section introduces a new speech/non-speech detector that does not require any training data while achieving similar performance to the prior pre-trained system on non-speech detection, and better diarization performance.
The second section covers four algorithms used in the definition of the speaker clusters and the related models. The first algorithm automatically defines a number of initial clusters for the agglomerative clustering to start with. The second algorithm obtains an initial clustering by classifying the acoustic data into the desired number of initial clusters. On the topic of speaker modeling, the third algorithm is used to determine the complexity of each model in the system given the amount of data available for training. Finally, a modification to the baseline duration modeling is proposed to avoid any artificial constraints imposed previously to the speaker turn duration.
The third section explores the problems derived of clusters
containing data other than a single-speaker. When comparing two
speaker models an erroneous decision can be made depending on the
amount of such misplaced data. This section presents two
algorithms to purify the clusters and avoid such problems. On one
hand, the frame level purification modifies the speaker models
only in the comparison step by filtering out acoustic frames that
might harm the comparison. On the other hand, the segmentation
level purification detects full segments that do not match the
cluster they belong to and assigns it to a new cluster.