This PhD thesis verses about the topic of speaker diarization for meetings. While answering to the question ``Who spoke when?'', the presented speaker diarization system is able to process a variable number of microphones spread around the meeting room and determine the optimum output without any prior knowledge of the number of speakers or their identities.
The presented system uses as baseline the technology in speaker diarization for broadcast news existent at the International Computer Science Institute (ICSI) and adapts it to the meetings domain by developing new algorithms and improving existent ones to adapt the system to the desired meetings environment. While prior work in the topic of speaker diarization for meetings proposed some sorts of parallel diarization processing of the acoustics and a fusion of the multiple channel outputs, the proposed system uses acoustic beamforming to obtain an ``enhanced'' single channel and information about the speaker positions in order to use them combined in a single-channel speaker diarization process.
Then the system discards non-speech segments using a new hybrid speech/non-speech detector and processes both acoustics and speaker position information. Algorithms include automatic algorithms for models complexity selection, initialization and training, number of initial clusters and their initial segments, frame and segment purification algorithms and others.
The development of the system was closely linked to participation in the speaker diarization evaluations in Rich Transcription (RT) for meetings proposed by NIST in 2005 and 2006. In both submissions the systems proposed by ICSI both for lecture anc conference room data, with various numbers of microphones, obtained consistently good results.
Experiments were done using the NIST Rich Transcription evaluations datasets to analyze the suitability of each individual module, obtaining results that can be easily compared with other systems and implementations. A 41.15% relative improvement is reported for the development set comparing the system at the start of the thesis to the optimum system proposed. A 25.45% relative improvement is reported for the evaluation set.
user 2008-12-08