Within the NIST 2004 Spring Rich Transcription Evaluation
(NIST Spring Rich Transcription Evaluation in Meetings
website,
http://www.nist.gov/speech/tests/rt/rt2005/spring, 2006) speaker diarization was evaluated in
meeting recordings in two different conditions: Multiple Distant
Microphones (MDM) and Single Distant Microphone (SDM). The MDM
condition uses multiple microphones located in the center of a
meetings table, and the SDM case uses only one of these
microphones, normally the most centrally located. This is the
first time that this task was performed for meetings environment
on the MDM condition. A full description of the different tasks
evaluated and the results of such evaluation can be found in
Garofolo et al. (2004). Following are the approaches (in
brief) that the participants proposed for the MDM and SDM
conditions:
- Macquarie University in Cassidy (2004) proposes
the same system for SDM than for MDM, using always the SDM
channel. A BIC based speaker segmentation step is followed by an
agglomerative clustering using Mahalanobis distance between
clusters and BIC as stopping criterion.
- The ELISA consortium in Fredouille et al. (2004)
proposes a two-axis merging strategy. An horizontal merging
consists on the collapse and resegmentation of the clustering
output of their two expert systems (based on BIC and EHMM) as
proposed in the RT03 and SRE02 evaluations (Moraru, Meignier, Fredouille, Besacier and
Bonastre, 2004).
This is done for each individual MDM channel or for the SDM
channel. The vertical merging is applied when processing multiple
channels and unifies all the individual channels into one single
resulting output by merging all channels at the output level. It
uses an iterative process that searches for the longest speaker
interventions that are common to all outputs and finally assigns
to the closest speaker those segments of short duration where the
different channels do not agree on.
- Carnegie Mellon University (CMU) in Jin et al. (2004)
presents a clustering scheme based on GLR distance and BIC
stopping criterion. In order to obtain the initial segmentation of
the data it does a three steps process, first a Speech Activity
Detection (SAD) is done over all the channels, then the resulting
segments for all channels are collapsed into a single segmentation
and the best channel (according to an energy/SNR metric) is chosen
for each segment. Finally GLR change detection is applied on
segments 5s to detect any missed change point. The speaker
clustering is done using a global GMM trained on all the meeting
excerpt data and adapted to each segment and uses GLR to compute
the cluster pair distances to be used in an agglomerative
clustering processing with BIC stopping criterion.
user
2008-12-08