The RT06s evaluation continues its parallel testing of conference
room data and lecture room data. This year five laboratories
participated in the evaluation, making it a very good evaluation
in terms of new systems and ideas. For a full description refer
to Fiscus, Ajot, Michet and Garofolo (2006). An overview of the systems in RT06s
follows:
- The Athens Information Technology (AIT) system
(Rentzeperis et al., 2006) uses a speaker segmentation and then
clustering steps. The classic BIC implementation (Shaobing Chen and Gopalakrishnan, 1998)
is used for speaker segmentation as their primary system. A
contrastive system uses a silence-based method cutting segments in
silence points. A first step of the clustering process it also
uses BIC to merge adjacent segments believed to be from the same
speaker. Finally, all segments are modeled with GMM and a
likelihood based technique is used to cluster them.
- The LIMSI system (Zhu et al., 2006) adapts their
high-performance system presented for RT04f (Zhu et al., 2005) in
order to process lecture room data. It is based on a 2-stage
processing where a BIC agglomerative clustering precedes a speaker
identification module where cross likelihood (Reynolds et al., 1998)
is used to finish the clustering. In this system the speech
activity detection module is reworked to adapt it to the lecture
acoustics by using a likelihood ratio between pretrained speech
and silence models. The MDM condition is processed by randomly
selecting one of the channels in the set and running the system in
that one alone.
- The LIA system (improvements of the E-HMM based speaker diarization system
for meetings records, 2006) presents a single
system based on the EHMM top-down hierarchical clustering that has
been presented in previous evaluations. In this submission there
are a few improvements to the system. One improvement deals with
the selection of new speakers added to the system, which is
modified to take into account all currently selected speakers to
make it more robust and allow for all speakers to fall at least in
one cluster. Also, a segment purification algorithm is proposed
following Anguera, Wooters, Peskin and Aguilo (2005) in order to purify the
existing clusters from segments belonging to other speakers.
Furthermore, some feature normalization techniques were applied at
the frontend level. Finally, an algorithm to detect overlapping
speech was proposed, although it did not succeed in lowering the
final diarization error rate.
- The AMI team (Leeuwen and Huijbregts, 2006) was formed by TNO and
University of Twente. They presented three systems to the
evaluation. The first system is very similar to what was presented
by TNO in RT05s (van Leeuwen, 2005). The other two systems use
a hierarchical clustering following the work at ICSI and presented
in Anguera, Wooters, Peskin and Aguilo (2005). One of the two systems improves in
runtime by considering a Viterbi-based clusters merging criterion.
Each cluster is taken out of the ergodic HMM model (one at a time)
and a Viterbi decoding gives the likelihood of the rest modeling
the data. The cluster which causes the least loss in likelihood is
eliminated and merged with the rest. The system iterates while the
overall likelihood increases.
- The ICSI system (Anguera, Wooters and Pardo, 2006b) is based on the system
for RT05s (Anguera, Wooters, Peskin and Aguilo, 2005) and includes many new ideas which
will be covered in the rest of this thesis. The main step forward
is the total independence from training data achieved by the
creation of a new hybrid speech/non-speech detector
(Anguera, Aguilo, Wooters, Nadeu and Hernando, 2006) and the inclusion of delays as an independent
feature stream.
user
2008-12-08