One outstanding characteristic of the meetings domain is that multiple microphones are usually available for processing. The time differences between microphones can be used as a feature to identify the speakers in a room by their locations as the speech uttered by each speaker takes a different time to reach each of the microphones according to their position in the room. Such feature has two main drawbacks from the acoustic features. On one hand it is prone to errors when speakers are located in symmetry to the microphones. On the other hand, they become less tractable when two speakers move inside the room, which accounts then for tracking algorithms to be used.
For the task of speaker segmentation, in Lathoud, McCowan and Odobez (2004) a speaker tracking approach is proposed using only between channel differences. In Lathoud, Odobez and McCowan (2004) the same is extended to speaker clustering and algorithms are proposed for detection of concurrent events. Ellis and Liu (2004) and Pardo et al. (2006a) also use only delays for clustering.
Given the literature, the delays between channels can not outperform the acoustic features, although in Ajmera, Lathoud and McCowan (2004) it is shown that the combination of delays and MFCC parameters can improve clustering. In Pardo et al. (2006b) it reaches the same conclusion and further improves results by using a weighted combination of the delays and MFCC likelihoods.
user 2008-12-08