Reference Segmentation Selection and Calculation

The use of predefined reference segmentations is necessary to compute the DER given the system hypotheses. The data used in this chapter all comes from the NIST evaluations, which defined a set of rules on how the transcription should be made. In the latest evaluation (NIST Fall Rich Transcription on meetings 2006 Evaluation Plan, 2006) they were:

Within the NIST evaluation campaigns all data sent out for development and test was carefully transcribed by hand by the Linguistic Data Consortium (LDC). Such transcription was usually done listenning to the channel with the best quality possible (which usually is the Individual Headphone Channel, IHM, when available) for each participant, and then the transcriptions are collapsed into a main reference file for all participants.

Prior to the RT06s evaluation it was under consideration by NIST and by some of the participants (including ICSI) the use of forced alignments of the acoustic data. Although in RT06s still hand alignments were used, it is the intention of NIST to change the reference transcriptions to be forced alignments in the near future. The need for such change became strong when areas in overlap started being scored as part of the main metric for system performance. In chapter 3.2 a quantitative comparison is done between forced and hand alignments. In brief, the main drawbacks found in the hand-aligned references are:

It was decided at ICSI that development for the RT06s evaluation had to be done using forced alignments in order to avoid these problems. In order to obtain the forced alignment of a meeting recording a two steps process was followed:

  1. The human words transcription for each one of the IHM channels was used to do a forced alignment of the audio in each of the IHM channels to such transcription, obtaining a time-aligned word transcription for each speaker with a headset on. To do so, the ICSI-SRI ASR system (Janin et al., 2006) was used. Experiments pursued by NIST after the RT06s evaluation Fiscus, Garofolo, Ajot and Michet (2006) indicated that very similar behaviors for all participants could be obtained using either ICSI-SRI transcriptions or LIMSI's ASR system transcriptions.

  2. The transcriptions from each individual speaker were collapsed into a single file and the transcription rules were applied to determine when two words were to be joined into a single speaker segment or two speaker segments needed to be created.

By using forced alignments there are also several drawbacks to point out:

All results reported in this thesis were computed using the forced alignments obtained using the ICSI-SRI ASR system, unless otherwise stated.

user 2008-12-08