Reference Segmentation Selection and Calculation

The use of predefined reference segmentations is necessary to compute the DER given the system hypotheses. The data used in this chapter all comes from the NIST evaluations, which defined a set of rules on how the transcription should be made. In the latest evaluation (NIST Fall Rich Transcription on meetings 2006 Evaluation Plan, 2006) they were:

Within a speaker's turn, pauses lasting less than 0.3 seconds are to be considered to belong to that speaker turn. Pauses with more than 0.3 seconds or in between different speaker turns are to be considered non-speech. This value was determined in 2003 by NIST as the minimum duration for a pause that would indicate an utterance boundary.
Vocal noises, such as laugh, cough, sneeze, breath and lipsmack are to be considered non-speech, and take this into account when considering segment boundaries.
Although not a rule in creating the transcriptions, it is worth mentioning again the collar of $\pm0.25$ seconds to be considered around each reference segment boundary when comparing it to the hypothesis in order to account for inexactitudes in computing the real segment boundary.

Within the NIST evaluation campaigns all data sent out for development and test was carefully transcribed by hand by the Linguistic Data Consortium (LDC). Such transcription was usually done listenning to the channel with the best quality possible (which usually is the Individual Headphone Channel, IHM, when available) for each participant, and then the transcriptions are collapsed into a main reference file for all participants.

Prior to the RT06s evaluation it was under consideration by NIST and by some of the participants (including ICSI) the use of forced alignments of the acoustic data. Although in RT06s still hand alignments were used, it is the intention of NIST to change the reference transcriptions to be forced alignments in the near future. The need for such change became strong when areas in overlap started being scored as part of the main metric for system performance. In chapter 3.2 a quantitative comparison is done between forced and hand alignments. In brief, the main drawbacks found in the hand-aligned references are:

Transcriptions time inconsistency due to the gap of 1 year between each of the transcriptions for each evaluation, which leads to a change in transcription criterions, human transcriber, transcription tools, etc. Leading to consistent differences between the reference files to which the systems try to learn from.
Inability, at times, to detect short speaker pauses when these are around 0.3 seconds. This leads to problems for systems which are trained to this data and which are impeded to determine when a speaker pause has to be a silence and when it does not.
Existence of extended durations when labelling the overlap speech. As seen in chapter 3.2 the average length of speech in overlap is bigger in the hand-alignments, usually so as the human transcribers added some arbitrary padding to either side of some overlap regions, leading to greater overlap errors. Such difference varies from evaluation to evaluation and was detected only in RT06s data when overlap became part of the main metric.
The inability, at times, to identify in the distant microphones (the ones actually used in the evaluations) some sounds or artifacts that are heart and transcribed in the IHM channels (much closer to each speaker's mouth).

It was decided at ICSI that development for the RT06s evaluation had to be done using forced alignments in order to avoid these problems. In order to obtain the forced alignment of a meeting recording a two steps process was followed:

The human words transcription for each one of the IHM channels was used to do a forced alignment of the audio in each of the IHM channels to such transcription, obtaining a time-aligned word transcription for each speaker with a headset on. To do so, the ICSI-SRI ASR system (Janin et al., 2006) was used. Experiments pursued by NIST after the RT06s evaluation Fiscus, Garofolo, Ajot and Michet (2006) indicated that very similar behaviors for all participants could be obtained using either ICSI-SRI transcriptions or LIMSI's ASR system transcriptions.
The transcriptions from each individual speaker were collapsed into a single file and the transcription rules were applied to determine when two words were to be joined into a single speaker segment or two speaker segments needed to be created.

By using forced alignments there are also several drawbacks to point out:

In the way that these were done, an IHM channel needed to be provided for each participant in the meetings in order to obtain that channel's alignment. One meeting in RT05s (named NIST_20050412-1303) contained a speaker through a telephone speaker which was not considered, therefore creating a transcript lacking of some of the data. This could be avoided by using other channels instead, trying to always select the optimum quality source.
Errors in the transcription of the words (which is done so by human transcribers) propagates into the forced-alignments. These errors were measured to be much smaller than transcribing the speaker turns directly.
Each ASR system does their own systematic errors/decisions which translate into systematic segmentation issues. These are thought to be the difference between every ASR forced-alignment output that can be used. Although such difference is very small, in order to create good quality transcripts, reducing this variability, they could be derived from the output of multiple systems.

All results reported in this thesis were computed using the forced alignments obtained using the ICSI-SRI ASR system, unless otherwise stated.

user 2008-12-08