In some occasions the use of the term speaker diarization is confused with speaker clustering. One must refer as speaker clustering the techniques and algorithms that agglutinate together all segments that belong to the same speaker. This does not entail wether such segments come from the same acoustic file or different ones. It also does not say anything about how acoustically homogeneous segments within a single file are obtained. The term speaker diarization refers to the systems that perform a speaker segmentation of the input signal and then a speaker clustering of the created segments into homogeneous groups (or some hybrid mechanism doing both at the same time), all within the same file or input stream.
In the literature one can normally find two main applications for speaker diarization. On one hand, Automatic Speech Recognition (ASR) systems make use of the speaker homogeneous clusters to adapt the acoustic models to be speaker dependent and therefore increase recognition performance. On the other hand, speaker indexing and rich transcription systems use the speaker diarization output as one of (possibly) many information pieces extracted from a recording, which allow its automatic indexation and other further processing areas.
This section reviews the main systems present in the literature for both applications. It mainly focuses on systems that propose solutions to a blind speaker diarization problem, where no information is known a priori about the number of people or their identities. On one hand, it is crucial for systems oriented towards rich transcription of the data to accurately estimate the number of speakers present, as error measures penalize any incorrectly assigned speaker segment. On the other hand, in ASR systems it becomes more important to have sufficient data to accurately adapt the resulting speaker models, therefore several speakers with similar acoustic characteristics are preferably grouped together.
At a high level point of view one can differentiate between online and offline systems. The systems that process the data offline have access to all the recording before they start processing it. These are the most common in the bibliography and they are the main focus of attention of this review. The online systems only have access to the data that has been recorded up to that point. They might allow a latency in output to allow for a certain amount of data to become available for processing, but in any case no information on the complete recording is available. Such systems usually start with one single speaker (whoever starts talking at the beginning of the recording) and iteratively increase the number of speaker as they intervene. The following are some representative systems used for online processing:
In Mori and Nakagawa (2001) a clustering algorithm based on the Vector Quantization (VQ) distortion measure (Nakagawa and Suzuki, 1993) is proposed. It starts processing with one speaker in the code-book and incrementally adds new speakers whose VQ distortion exceeds a threshold in the current code-book.
In Rougui et al. (2006) a GMM based system is proposed, using a modified KL distance between models. Change points are detected as the speech becomes available and data is assigned to either speaker present in the database or a new speaker is created, according to a dynamic threshold. Emphasis is put into fast classification of the speech segments into speakers by using a decision tree structure for speaker models.
All systems presented below are based on offline processing, although some of the techniques presented could potentially be used also in an online implementation. These systems can be classified in two main groups, on one hand the hierarchical clustering techniques reach the optimum diarization by iterative processing of different number of possible clusters obtained by merging or splitting existing clusters. On the other hand, other clustering techniques first estimate the number of clusters and obtain a diarization output without deriving the clusters from bigger/smaller ones.