In the lecture room environment the submission consisted on
primary systems for the tasks MDM, SDM and MSLA, and contrastive
systems for MDM (two systems), SDM and MSLA (two systems).
Following is a brief description for each of these systems and
their motivation:
- MDM, SDM and MSLA primary condition
(MDM/SDM/MSLA_p-omnione): It was observed in the development data
that on many occasions it was possible to obtain the best
performance by just guessing one speaker for the whole duration of
the lecture. This is particularly true when the meeting excerpt
consists only of the lecturer speaking, but is often also achieved
in the question-and-answer section since many of the excerpts in
the development data consisted of very short questions followed by
long answers by the lecturer. They were therefore presented as the
primary submissions, serving also as a baseline score for the
lecture room environment. Contrary to what was observed in the
development data, the contrastive (``real'') systems outperformed
the primary (``guess one speaker'') submissions on the evaluation
data. Depending on what data is to be processed (the length of the
lecturer turn and the amount of silence in the recordings) it
might not be feasible to improve upon a ``dummy'' system with the
current state of the art diarization systems.
- MDM using speech/non-speech detection (mdm_c-spnspone):
This differs from the primary submission only on the use of the
speech/non-speech (spnsp) detector to eliminate the areas of
non-speech. On the development data it was observed that
non-speech regions were only labelled (in the hand-made
references) when there was a change of speakers, which never
happened for the ``all lecturing'' sections. In a real system
though it is important to detect these silences and not attribute
them to speakers. This submission is meant to complement the
previous one by trying to improve performance where between-speech
silences are marked.
- MDM using only the TableTop microphone (mdm_c-ttoppur):
From the available five microphones in the lecture room, one
microphone (labelled as ``TableTop'' microphone) is clearly of
much better quality than all the others (which can be found via an
SNR comparison among the channels). It is located in a different
part of the room and is of a different kind, which could be the
reason for its better performance. In the evaluation data it was
found by using an SNR estimator and the standard diarization is
used on it. No spnsp detection was used in this system.
- SDM using the SDM channel with a minimum duration of 12
seconds for each cluster (sdm_c-pur12s): This uses the clustering
system on the SDM channel. It didn't use the spnsp detector
either. It was observed that using a minimum duration of 12
seconds, the issue of silences marked as speech in the reference
files could be bypassed, and force the system to end with fewer
clusters.
- MSLA with standard filter&sum (msla_c-nwsdpur12s): In
order to combine the various available speaker-localization
arrays, we used the filter&sum processing, using a random channel
from one of the arrays as the reference channel. The enhanced
channel obtained was then clustered using the 12 second minimum
duration system.
- MSLA with weighted filter&sum (msla_c-wsdpur12s): In the
time between the conference room and lecture room submissions,
experiments were performed with a first version of the weighted
filter&sum algorithm as presented in this thesis. It was applied
to the MSLA channels in this system.
user
2008-12-08