The speaker segmentation and clustering block of the overall speaker diarization system contains two main blocks, the speech/non-speech detector and the single-channel speaker diarization system. The speech/non-speech detector is different from the one used for broadcast news and does not require any training data for the acoustic models. It is a hybrid energy-based/model-based system which considers that most non-speech to be detected in a meeting, which can harm the diarization, is silence. It will be further described in 4.1.
The single-channel speaker diarization module has evolved from the broadcast news system by adding new algorithms and proposing improvements to existing ones. In figure 3.6 a block diagram of the diarization process is shown with the newly proposed algorithms and changes to the baseline system in a darker box. There are also other improvements in various steps of the algorithms that are not reflected in the figure, these are the modification in duration modeling within the models and the models initialization algorithm. In the following sections a description of each one of these new modules and algorithm improvements is described in detail. For those not changing from the baseline refer to section 3.1 for complete details.
As mentioned earlier, the meetings speaker diarization system makes use of the TDOA values (when available) as an independent feature stream. These features contain N-1 dimension vectors, where N is the number of channels available, computed at the same rate as the MFCC parameters for synchronous operation. They are reused without any further processing, just converting them to HTK format to be properly read by the system.
The acoustic features continue to be Mel Frequency Cepstrum Coefficients (MFCC) but with an analysis window of 30ms (instead of 60ms) and computed every 10ms (instead of 20ms). The increase in computation due to having double the amount of features is explained by an increase in performance.