Metric-Based Segmentation

Metric based segmentation is probably the most used technique up to date. It relies on the computation of a distance between two acoustic segments to determine whether they belong to the same speaker or to different speakers, and therefore whether there exists a speaker change point in the audio at the point being analyzed. The two acoustic segments are usually next to each other (in overlap or not) and the change-point considered is between them. Most of the distances used for acoustic change detection can also be applied to speaker clustering in order to compare the suitability that two speaker clusters belong to the same speaker.

Let us consider two audio segments ($ i$,$ j$) of parameterized acoustic vectors $ \mathcal{X}_{i}$ and $ \mathcal{X}_{j}$ of lengths $ N_{i}$ and $ N_{j}$ respectively, and with mean and standard deviation values $ \mu_{i},\sigma_{i}$ and $ \mu_{j},\sigma_{j}$. Each one of these segments is modeled using Gaussian processes $ M_{i}(\mu_{i},\sigma_{i})$ and $ M_{j}(\mu_{j},\sigma_{j})$, which can be a single Gaussian or a Gaussian Mixture Model (GMM). On the other hand, let's consider the agglomerate of both segments into $ \mathcal{X}$, with mean and variance $ \mu,\sigma$ and the corresponding Gaussian process $ M(\mu,\sigma)$.

In general, there are two different kinds of distances that can be defined between any pair of such audio segments. The first kind compares the sufficient statistics from the two acoustic sets of data without considering any particular model applied to the data, which from now on will be called statistics-based distances. These are normally very quick to compute and give good performances if $ N_{i}$ and $ N_{j}$ are big enough to robustly compute the data statistics and the data being modeled can be well described using a single Gaussian.

A second group of distances are based on the evaluation of the likelihood of the data according to models representing it. These distances are slower to compute (as models need to be trained and evaluated) but can achieve better results than the statistics-based one as bigger models can be used to fit more complex data. These will be referred as likelihood-based techniques. The following are the metrics that have been found of interest used in the literature for either case:

All of these metric-based techniques compute a function whose maxima/minima need to be compared with a threshold in order to determine the suitability of every change point. In many cases the threshold is defined empirically given a development set, according to a desired performance. Such proceeding leads to a threshold which is normally dependent on the data being processed and that needs to be redefined every time data of a different nature needs to be processed. This problem has been studied within the speaker identification community in order to classify speakers in an open set speaker identification task (see for example Campbell (1997)). In the area of speaker segmentation and clustering some publications propose automatic ways to define appropriate thresholds, for example:

user 2008-12-08