Average Total Speaking Time

Apart from looking at the acoustic signal, the reference transcriptions were also analyzed. The first parameter computed both on the meetings and broadcast news is the speaking time per speaker in each of the shows. This is important as it is necessary to create models that can train optimally to the data, and therefore need to be adjusted if the amount of data per speaker changes across domains. Tables 3.8, 3.9 and 3.10 show the number of speakers, average time per speaker and maximum and minimum speaking times.

Table 3.8: Average total speaker duration in RT06s conference room data

Show_ID	#	ave. time	max. time	min. time	ave. time	max. time	min. time
	spkr.	manual	manual	manual	FA	FA	FA
CMU_20050912-0900	4	373.01	620.49	169.77	283.71	487.14	105.96
CMU_20050914-0900	4	368.04	544.88	131.66	277.61	444.69	102.46
EDI_20050216-1051	4	301.65	432.01	184.03	224.01	306.88	111.21
EDI_20050218-0900	4	318.01	489.42	210.46	238.46	361.15	162.5
NIST_20051024-0930	9	182.45	503.49	32.83	118.46	384.29	1.15
NIST_20051102-1323	8	179.25	336.05	46.04	121.47	257.0	2.76
VT_20050623-1400	5	258.16	438.16	155.11	195.60	342.06	118.84
VT_20051027-1400	4	244.27	581.27	103.53	184.21	457.39	70.19

Table 3.9: Average total speaker duration in RT06s lecture room data

Show_ID	# speakers	Ave. time	max. time	min. time
AIT_20051010/Segment1	4	52.68	202.06	2.00
AIT_20051011_B/Segment1	4	65.01	233.91	0.66
AIT_20051011_C/Segment1	4	57.94	160.63	5.67
AIT_20051011_D/Segment1	5	52.80	228.07	0.48
IBM_20050819/Segment1	4	63.07	156.09	28.83
IBM_20050822/Segment1	2	127.47	253.48	1.46
IBM_20050823/Segment1	4	70.86	172.02	30.65
IBM_20050830/Segment1	1	251.27	-	-
ITC_20050503/Segment1	4	66.81	234.06	6.16
ITC_20050607/Segment1	4	64.64	207.49	2.82
UKA_20050420_A/Segment1	3	80.53	228.30	3.69
UKA_20050420_A/Segment2	2	121	178.31	63.68
UKA_20050427_B/Segment1	1	157.51	-	-
UKA_20050504_A/Segment1	1	219.16	-	-
UKA_20050504_B/Segment1	1	253.86	-	-
UKA_20050504_B/Segment2	3	86.74	222.55	6.13
UKA_20050525_A/Segment1	1	240.71	-	-
UKA_20050525_A/Segment2	4	60.09	153.77	3.53
UKA_20050525_B/Segment1	2	114.11	226.36	1.87
UKA_20050525_C/Segment1	2	119.76	238.52	1.00
UKA_20050615_A/Segment1	3	55.28	147.78	6.00
UKA_20050622_B/Segment1	1	211.22	-	-
UKA_20050622_C/Segment1	3	80.38	214.86	8.38
UKA_20050622_C/Segment2	1	230.59	-	-
UPC_20050706/Segment1	5	43.91	182.58	5.96
UPC_20050720/Segment1	5	56.16	255.72	3.17
UPC_20050722/Segment1	5	57.71	234.39	2.88
UPC_20050727/Segment1	5	87.78	139.46	62.73

Table 3.10: Average total speaker duration in RT04f broadcast news data

Show_ID	# speakers	Ave. time	max. time	min. time
20031202_050216_CNN_ENG	15	65.92	462.31	5.7
20031202_203013_CNBC_ENG	13	69.41	322.5	1.07
20031203_183814_ABC_ENG	28	38.67	196.86	2.06
20031204_130035_CNN_ENG	13	99.79	479.12	13.73
20031206_163852_CSPAN_ENG	4	345.62	941.21	7.76
20031209_193152_ABC_ENG	30	33.84	292.49	1.26
20031209_193946_PBS_ENG	15	86.52	349.40	0.47
20031215_204057_CNNHL_ENG	10	110.62	312.53	8.15
20031215_231058_WBN_ENG	29	39.52	536.04	0.44
20031217_184122_ABC_ENG	25	42.56	181.38	1.87
20031218_004126_PBS_ENG	27	47.52	353.56	0.4
20031219_202502_CNBC_ENG	25	38.48	351.59	1.10

In all cases the average values vary greatly even within the same domain. For example, the CSPAN show in broadcast news contains an average speaker length several orders of magnitude higher than any of the other shows. This is due to the rather common total show lengths imposed by NIST for the evaluations but the variability in the number of speakers existent in each recording. from these results it is clear that an automatic way of selecting the speaker models complexity is necessary in order to be able to model each of the possibilities correctly, as the more data available from a speaker, the more complex the models need to be to be able to represent the same level of detail for that speaker compared to others.

Another observation is on the minimum and maximum speaking times columns. The maximum speaking time indicates how long has the main speaker in the recording spoken. In both the lecture room and broadcast news recordings this column tends to contain values very much higher than the average speaking time. In lectures it is the case when the excerpt mainly contains the lecturer giving his talk (sometimes filling the entire excepts and sometimes with a small question and answers section). In the broadcast news shows it is usual when the show contains an anchor speaker that directs the flow of the program. In the conference room meetings the NIST shows also tend to have a dominant speaker.

The minimum speaking time column indicates the length of time that the speaker with less interventions speaks. In many of the lecture room meetings this is nonexistent as the lecturer speaks for the whole time. In the other cases, many of the recordings in lectures and broadcast news, and the NIST recordings in conference room, contain very short durations. These speakers are difficult to model as not much data is available and could create many problems and errors when comparing their models with the longer speaking ones. This is why it is sometimes desirable to talk of agglomerative clustering systems (like the one presented in this thesis) as having the goal of obtaining the optimum number of final clusters, instead of the exact number of existing speakers. Although detecting these short speakers and labelling them as independent clusters is always desirable, it can normally lead to other errors and therefore should be considered as a secondary priority.

In Table 3.8 two different transcriptions were used to compute these parameters. On one hand, one set of transcriptions were generated by hand, distributed by NIST and used in the evaluations. On the other hand, another set of reference transcriptions were generated automatically via forced-alignment of the reference speech-to-text transcriptions to the IHM channels. The forced-alignments are the ones used in this thesis in the experiments section. For a more detailed description on the differences and motivation behind the forced-aligned transcriptions refer to the experiments chapter 6.