This thesis is split into seven main chapters on the topic of robust speaker diarization for meetings. A brief description follows of what is to be found in each chapter.
Chapter 2 takes a look into the proposed problem: how to robustly and optimally determine ``Who spoke when?'' in a meeting domain where multiple microphones are usually available for recording. In order to address it, a review of what features have been previously used in speaker-related problems is followed by an analysis of the state of the art in speaker segmentation, which plays an important part in many speaker diarization algorithms. Then a review of previously proposed diarization algorithms and implementations sets the ground for a description of the projects, databases and systems that, to the date, have had their main focus in the meetings domain. Finally, and given the multichannel nature of a meeting room, acoustic enhancement theory is introduced to process multiple microphones, and the main techniques are reviewed for the purpose of obtaining a single ``enhanced'' channel from multiple inputs.
Chapter 3 leads the reader through the system implementation, basing it in the diarization system that existed for broadcast news prior to this thesis work. An initial review of the ideas behind the system and the implementation of the broadcast news speaker diarization system is followed by an analysis of the differences and needs in order to adapt it to the meetings domain. Finally, a description of the meetings implementation in comparison to the prior system is pursued. Each of the blocks and algorithms that have been reused, refurbished or created from scratch for the meetings domain are introduced, while leaving for later chapters the description in detail of the novel algorithms presented in this thesis.
Chapter 4 describes in detail all the novel techniques introduced in this thesis for the processing of single channel acoustic data. These include a new speech/non-speech decoder which improves the previous version by being totally train-free and more adapted to the diarization process. Also, several algorithms for speaker clusters description and modeling, including algorithms for description of the number of clusters, model complexity selection, a new training algorithm and two cluster purification algorithms.
Chapter 5 describes in detail the particular characteristics of using multiple channels in a meeting room and how the diarization processing can benefit from them. A filter&sum beamforming algorithm is selected for the task. The algorithm basic description and its implementation is described, explaining both well known and novel algorithms used for it. Also in this chapter a description of the use of the Time Delay of Arrival (TDOA) between channels as a parallel feature stream for the diarization module is pursued. Finally, a novel algorithm for the weight determination between TDOA and acoustic features is described.
Chapter 6 describes the experiments to show the appropriateness of all techniques. First, it describes the setup for running the experiments and then shows and explains the results for each one, comparing it to a baseline derived from the original broadcast news system prior to this thesis work or from intermediate (well established) points.
Chapter 7 describes the content and motivations behind the NIST Rich Transcription evaluations, which has been the tool used to assess the quality and to compare the proposed diarization system to other research systems and is the source of all datasets used in the experiments. A description of ICSI's submissions for 2005 and 2006 is explained in detail and results for those evaluations are given.
Finally, chapter 8 summarizes the major contributions and results obtained in this thesis and proposes some improvements and future work.
user 2008-12-08