Acoustic Features for Speaker Diarization

Speaker diarization falls into the category of speaker-based processing techniques. Features extracted from the acoustic signal are intended to convey information about the speakers in the conversations in order to enable the systems to separate them optimally.

Like in speaker recognition and speech recognition systems, commonly used parametrization features in speaker diarization are Mel Frequency Cepstral Coefficients (MFCC), Linear frequency cepstral coefficients (LFCC), Perceptual Linear Predictors (PLP), Linear Predictive Coding (LPC) and others.

Although the aforementioned parametrization techniques yield a good performance in current speaker diarization and recognition systems, they are usually not focused on representing the information relevant to distinguishing between speakers and to isolate such information from other interfering sources (like non-stationary noises, background music and others). Nevertheless speaker recognition and diarization systems like the one presented in this thesis use MFCC parameters with a higher number of coefficients as it is known that the higher coefficients do incorporate speaker information.

In this section some research is pointed out that propose alternative parameters focusing on the speaker characteristics and/or particular conditions of the tasks that they are applied to, all within the speaker-based area, which can constitute an advantage if used alone or in conjunction with the most common parametrization techniques. Although the use of these parameters is still not general, these should constitute the tip of the iceberg of parameters exploiting speaker information to come.

In Yamaguchi et al. (2005) propose a speaker segmentation system using energy, pitch frequency, peak-frequency centroid and peak-frequency bandwidth, and adds three new features: temporal feature stability of the power spectra, spectral shape and white noise similarities; all three related to the cross correlation of the power spectrum of the signal.

In order to avoid the influence of background noises and other non-speaker related events, in Pelecanos and Sridharan (2001) and more recently in Ouellet et al. (2005), feature warping techniques are proposed to change the shape of the p.d.f. (probability density function) of the features to a Gaussian shape prior to their modeling. They have been applied with success in Sinha et al. (2005) and Zhu et al. (2006) for speaker diarization in broadcast news and meetings respectively.

In the area of speech activity detection (SAD) there have been also several features proposed in recent years. In Kristjansson et al. (2005) some well known features and other new ones are proposed, based on autocorrelation of the signal or on the spectrum characteristics.

In Nguyen (2003) a new theoretical framework for natural isometric frontend parameters based on differential geometry is presented and applied to speaker diarization, improving performance when used in combination to standard MFCC parameters.

In Moh et al. (2003), Tsai et al. (2004) and Tsai et al. (2005) speaker diarization systems are proposed by constructing a speaker space from the data and projecting the feature vectors in it prior to the clustering step. Similarly, Collet et al. (2005) proposes the technique of anchor modeling (introduced in Sturim et al. (2001)) where acoustic frames are projected into an anchor model space (previously defined from outside data) and perform speaker tracking with the resulting parameter vectors. They show that it improves robustness against outside interfering signals and they claim it to be domain independent.

When more than one microphone is collecting the recordings (for examples in meeting rooms) Pardo et al. (2006a), Pardo et al. (2006b), ICSI Meeting Recorder Project: Channel skew in ICSI-recorded meetings (2006), Lathoud and McCowan (2003) show that the use of the time-delays between microphones is useful for speaker diarization .

Finally, Chan et al. (2006) propose the use of vocal source features for the task of speaker segmentation using a system based on Delacourt and Wellekens (2000). Also in Lu and Zhang (2002b) a real-time 2-step algorithm is proposed by doing a Bayesian fusion of LSP, MFCC and pitch features.

user 2008-12-08