In the broadcast news domain it was shown in Wooters et al. (2004) that the speaker diarization performance can be improved by the use of a speech/non-speech detector as a first step to the agglomerative clustering process. The speech/non-speech system used in this previous work was based on acoustic models that needed to be trained on data as similar as possible to the test data. This poses a robustness problem when one intends to use the diarization system on ``unseen'' data, and slows down the portability of the system to new environments, where new training data needs to be labelled/located and new speech/non-speech models need to be trained. For this purpose an alternative was sought that would not require any training.
Among the systems that do not use acoustic models for speech/non-speech detection, the most widely used, when non-speech is considered mainly of a silence/noise nature, always include energy as a feature. The performance of such systems is dependent on setting appropriate thresholds which are typically tuned using some development data. Tests done with the energy-based decoder, that is part of the hybrid system presented below, showed that the optimum threshold depends on the meeting acoustics and therefore it should be tuned whenever data from a different source needed to be processed, falling in the same trap as with model-based detectors.
A novel system was designed and is presented in this thesis to perform speech/non-speech detection, and its application to speaker diarization in the meetings environment. Such system takes advantage of the fact that most non-speech in meetings is silence. It first performs an energy-based detection of the silence portions in the input data using energy derivative filtering based on Li et al. (2002). This system only needs a coarse setting of a threshold, which is then iteratively modified until hypothesizing a reasonable number of silence segments. The second stage of the system models speech and silence given the output from the first stage by using GMM models, and creates a final speech/non-speech segmentation to be used in the diarization system. By running this two-stage system, it is avoided to use any external training data to obtain an initial set of acoustic models. From these initial models several iterations are performed between segmenting the data and retraining the models to obtain the final speech/non-speech segmentation.
The introduced hybrid system therefore attempts at solving some of the problems from both model-based and energy-based speech/non-speech detectors. On one hand, it is avoided to accurate tune the energy threshold by using an iterative search of a rough speech/non-speech segmentation used to initialize the model-based decoder. On the other hand, by using such initialization on the model-based decoder, it is avoided having to train its models with pre-labelled data, resulting in a system that is freed of the need for training data.
In the following sections both the energy-based decoder and the model-based decoder used in the hybrid system are described. Finally, the combination of both systems into the hybrid decoder is explained.