Another parameter that describes the different subdomains of application is the number of expected speakers to be clustered. Given that the number of initial clusters needs to be higher than the optimum number of clusters, it is important to define an upper boundary on the number of speakers so that systems are ensured to be able to reach the optimum point. Although an optimal speaker diarization system using hierarchical agglomerative clustering should be able to start at a very high number of clusters and work its way down, in reality it makes a difference in the resulting performance the correct estimation of an appropriate upper limit for the number of clusters. This is explained more in detail in section 4.2.2.
The average number of speakers and their minimum and maximum values are represented for the three datasets in table 3.11. One can observe how in general the broadcast news shows contain a vast amount of speakers (averaging 19), although in the shows considered there was one case (CSPAN) with 4 speakers. This creates a very big variation (or standard deviation) between the values. The system processing the broadcast news data needs to ensure a good performance both when many speakers are present (with smaller speaking time) and when less are available. Without any automatic initial number of speakers detection algorithm, the system starts at 40 clusters.
In the case of meetings, the lecture room data contains many recordings where only the lecturer speaks, and other with several people, going to a maximum number of four speakers. The standard deviation is therefore smaller compared to broadcast news. On conference rooms the number of speakers range between 4 and 9, with an average of 5.25 speakers present. Without automatic detection the systems start at 10 or 16 clusters for meetings and at 5 or 10 for lecture room.