Apart from looking at the acoustic signal, the reference transcriptions were also analyzed. The first parameter computed both on the meetings and broadcast news is the speaking time per speaker in each of the shows. This is important as it is necessary to create models that can train optimally to the data, and therefore need to be adjusted if the amount of data per speaker changes across domains. Tables 3.8, 3.9 and 3.10 show the number of speakers, average time per speaker and maximum and minimum speaking times.
|
|
In all cases the average values vary greatly even within the same domain. For example, the CSPAN show in broadcast news contains an average speaker length several orders of magnitude higher than any of the other shows. This is due to the rather common total show lengths imposed by NIST for the evaluations but the variability in the number of speakers existent in each recording. from these results it is clear that an automatic way of selecting the speaker models complexity is necessary in order to be able to model each of the possibilities correctly, as the more data available from a speaker, the more complex the models need to be to be able to represent the same level of detail for that speaker compared to others.
Another observation is on the minimum and maximum speaking times columns. The maximum speaking time indicates how long has the main speaker in the recording spoken. In both the lecture room and broadcast news recordings this column tends to contain values very much higher than the average speaking time. In lectures it is the case when the excerpt mainly contains the lecturer giving his talk (sometimes filling the entire excepts and sometimes with a small question and answers section). In the broadcast news shows it is usual when the show contains an anchor speaker that directs the flow of the program. In the conference room meetings the NIST shows also tend to have a dominant speaker.
The minimum speaking time column indicates the length of time that the speaker with less interventions speaks. In many of the lecture room meetings this is nonexistent as the lecturer speaks for the whole time. In the other cases, many of the recordings in lectures and broadcast news, and the NIST recordings in conference room, contain very short durations. These speakers are difficult to model as not much data is available and could create many problems and errors when comparing their models with the longer speaking ones. This is why it is sometimes desirable to talk of agglomerative clustering systems (like the one presented in this thesis) as having the goal of obtaining the optimum number of final clusters, instead of the exact number of existing speakers. Although detecting these short speakers and labelling them as independent clusters is always desirable, it can normally lead to other errors and therefore should be considered as a secondary priority.
In Table 3.8 two different transcriptions were used to compute these parameters. On one hand, one set of transcriptions were generated by hand, distributed by NIST and used in the evaluations. On the other hand, another set of reference transcriptions were generated automatically via forced-alignment of the reference speech-to-text transcriptions to the IHM channels. The forced-alignments are the ones used in this thesis in the experiments section. For a more detailed description on the differences and motivation behind the forced-aligned transcriptions refer to the experiments chapter 6.
user 2008-12-08