The RT05s evaluation welcomed a different kind of meetings to be
evaluated. These are the meetings in a lecture environment, where
a speaker is giving a lecture in front of an audience and there
are eventual questions and answer periods. In this evaluation
systems could be presented for either or both subtasks (lecture
room and conference room data). The sets of microphones used was
extended from the previous evaluations due to the existence of two
new kinds in the lecture room data (entirely recorded by the
partners in the CHIL project). These were labelled as MM3A
(Multiple Mark III microphone arrays) which consisted on one or
several 64 elements microphone arrays developed by NIST and
positioned on one of the walls of the meetings room; and MSLA
(Multiple source localization microphones) which are four sets of
four microphones each, used primarily for speaker localization,
but available also for speaker diarization. For a more thorough
description of the tasks and microphone types please refer to
Fiscus et al. (2005). The following is a brief description of
the approaches taken in this evaluation:
- The Macquarie University system (Cassidy, 2004)
participated only on SDM which expands its work from the RT04s
system. In the RT05s submission it uses the KL distance between
clusters and does a post-processing of the segments using speaker
identification techniques to refine the segments-to-speakers
assignment.
- The TNO speaker diarization system (van Leeuwen, 2005)
presents a system for MDM using a single channel. It first uses a
Speech Activity Detector (SAD) to filter out non-speech frames.
Then it does a segmentation and clustering using an agglomerative
clustering via BIC.
- The ICSI-SRI speaker diarization system (Anguera, Wooters, Peskin and Aguilo, 2005)
uses a filter&sum module to obtain an enhanced signal on the MDM
condition, and then uses an iterative agglomerative clustering
using a BIC-alike metric. This system and its improvements for
RT06s are described in this thesis.
- The ELISA consortium system (Istrate et al., 2005) is different
from their system in RT04s in that a preprocessing step is
performed on the MDM channels to obtain a single enhanced channel.
It is based on a weighted sum of the individual channels, weighted
by their relative Signal to Noise Ration (SNR) without any
relative delays estimation. Three different clustering systems are
then proposed. The first system is based on EHMM
(Meignier et al., 2001), doing a top-down clustering. The second and
third systems are both bottom-up, one using speaker change
detection via GLR and agglomerative clustering via BIC, and the
other using BIC for change detection and UBM-BIC in the
agglomerative clustering part. All systems use a resegmentation
stage at the end in order to refine the speaker segments. For this
evaluation either system was run individually, with no collapse of
the different outputs.
user
2008-12-08