Definition of the Thesis Objectives

The main objective of this thesis is the development of a robust speaker diarization system towards its use in the meetings domain. In order to fully accomplish this, a set of concrete objectives is established (without any order of importance):

The speaker diarization system is to be built using the expertise accumulated at ICSI in the research done in broadcast news. First, the differences between broadcast news and meetings need to be analyzed. Then the mono-channel speaker diarization system used in broadcast news is to be adapted to the meetings domain by first addressing the points where both domains differ, and then improving current algorithms to improve its performance.
The resulting system is to be as independent as possible to any room distribution, number of microphones and placement, and kind of meeting. It should also be easy to be adapted to new domains with as little development time as possible. Within the meetings domain, algorithms should be able to obtain automatically all necessary parameters in each meeting and algorithms should work for all possible meeting conditions with acceptable performance. When porting the system to new domains it should perform well from the start.
The algorithms implemented for the meetings system should reduce ``show flakiness'' (Mirghafori and Wooters, 2006), which accounts for sudden changes to the system performance, within the same set, upon slight modification of its parameter settings. It should also improve within-sets robustness, with similar results when running the same system in different data than the development. This can be achieved by research in system parameters that focus on the particular characteristics of the individual audio excerpts instead of the whole set, thus becoming more robust to changes in the used set. These parameters need to have a flat performance response around the optimum to allow for small changes not to dramatically affect the outcome.
In a similar fashion, the system is also aimed at being train-free (no external data is used to train acoustic models prior to the test). This allows both a quick adaptation to domains and a robust performance when new data within the same domain has a different acoustic content than the development data. This was already a goal of the broadcast news system, where only the speech/non-speech detector needed to be trained. The proposed system aims at replacing this module by a train-free alternative and to implement all new algorithms and improvements to be independent of any data outside of the test set for models training. Development data will though be used to set the system parameters.
The system is developed for participation in the NIST Rich Transcription (RT) evaluations for 2005 and 2006 in order to benchmark the performance of the technology and algorithms implemented in comparison to other systems given the same data. All decisions taken and parameter settings are in tune with the existing rules in these evaluations, which intend to measure general system performance, without emphasis in any particular application.
Last but not least, emphasis is put at the publication of results and improvements made to the system to allow for other research groups to know the research progress made at ICSI in terms of speaker diarization. Furthermore, efforts are made into making the system available for people to use it, either entirely or some of its modules, and both internally or by external users, giving support when possible.

user 2008-12-08