Acoustic Modeling Algorithms for Speaker Diarization in Meetings

This chapter covers the main contributions of this thesis in the area of acoustic modeling for speaker diarization in the meeting domain. As pointed out earlier, these algorithms were defined either to improve an existing algorithm in the baseline system or created new to solve problems detected in the system.

This chapter is structured into three main sections. The first section introduces a new speech/non-speech detector that does not require any training data while achieving similar performance to the prior pre-trained system on non-speech detection, and better diarization performance.

The second section covers four algorithms used in the definition of the speaker clusters and the related models. The first algorithm automatically defines a number of initial clusters for the agglomerative clustering to start with. The second algorithm obtains an initial clustering by classifying the acoustic data into the desired number of initial clusters. On the topic of speaker modeling, the third algorithm is used to determine the complexity of each model in the system given the amount of data available for training. Finally, a modification to the baseline duration modeling is proposed to avoid any artificial constraints imposed previously to the speaker turn duration.

The third section explores the problems derived of clusters containing data other than a single-speaker. When comparing two speaker models an erroneous decision can be made depending on the amount of such misplaced data. This section presents two algorithms to purify the clusters and avoid such problems. On one hand, the frame level purification modifies the speaker models only in the comparison step by filtering out acoustic frames that might harm the comparison. On the other hand, the segmentation level purification detects full segments that do not match the cluster they belong to and assigns it to a new cluster.

Subsections

user 2008-12-08