Definition of the Thesis Objectives
The main objective of this thesis is the development of a robust
speaker diarization system towards its use in the meetings domain.
In order to fully accomplish this, a set of concrete objectives is
established (without any order of importance):
- The speaker diarization system is to be built using the
expertise accumulated at ICSI in the research done in broadcast
news. First, the differences between broadcast news and meetings
need to be analyzed. Then the mono-channel speaker diarization
system used in broadcast news is to be adapted to the meetings
domain by first addressing the points where both domains differ,
and then improving current algorithms to improve its performance.
- The resulting system is to be as independent as possible to
any room distribution, number of microphones and placement, and
kind of meeting. It should also be easy to be adapted to new
domains with as little development time as possible. Within the
meetings domain, algorithms should be able to obtain automatically
all necessary parameters in each meeting and algorithms should
work for all possible meeting conditions with acceptable
performance. When porting the system to new domains it should
perform well from the start.
- The algorithms implemented for the meetings system should
reduce ``show flakiness'' (Mirghafori and Wooters, 2006), which
accounts for sudden changes to the system performance, within the
same set, upon slight modification of its parameter settings. It
should also improve within-sets robustness, with similar
results when running the same system in different data than the
development. This can be achieved by research in system parameters
that focus on the particular characteristics of the individual audio
excerpts instead of the whole set, thus becoming more robust to
changes in the used set. These parameters need to have a flat
performance response around the optimum to allow for small changes
not to dramatically affect the outcome.
- In a similar fashion, the system is also aimed at being
train-free (no external data is used to train acoustic models prior
to the test). This allows both a quick adaptation to domains and a
robust performance when new data within the same domain has a
different acoustic content than the development data. This was
already a goal of the broadcast news system, where only the
speech/non-speech detector needed to be trained. The proposed system
aims at replacing this module by a train-free alternative and to
implement all new algorithms and improvements to be independent of
any data outside of the test set for models training. Development
data will though be used to set the system parameters.
- The system is developed for participation in the NIST Rich
Transcription (RT) evaluations for 2005 and 2006 in order to
benchmark the performance of the technology and algorithms
implemented in comparison to other systems given the same data.
All decisions taken and parameter settings are in tune with the
existing rules in these evaluations, which intend to measure
general system performance, without emphasis in any particular
- Last but not least, emphasis is put at the publication of
results and improvements made to the system to allow for other
research groups to know the research progress made at ICSI in
terms of speaker diarization. Furthermore, efforts are made into
making the system available for people to use it, either entirely
or some of its modules, and both internally or by external users,
giving support when possible.