Multichannel Processing for Meetings

In meeting room recordings there is normally access to more than one microphone that recorded what occurred in the room synchronously, bringing spacial diversity. It is desirable to take advantage of such signal multiplicity by using multichannel algorithms like acoustic beamforming techniques. In section 2.5 the basic microphone array theory and the main techniques covered by the literature in the topic of acoustic beamforming were reviewed, as well as its use for speech enhancement and the methods previously applied to the meetings environment to this effect.

In order to use multichannel beamforming techniques in the meetings domain one needs to consider the set of multiple microphones to constitute a microphone array. The characterization and use of this array is not done in a classical manner as the locations and characteristics of the available microphones can be non-conventional. The system to be used is required to be robust and not require much prior information as the microphones can be located anywhere (with varying distances between them) and can be of very different quality and characteristics (both in directivity and type). By applying the appropriate techniques, in most cases it is possible to obtain a signal quality gain and to improve speaker diarization and speech recognition results.

One of the necessary steps to perform acoustic beamforming on this environment, with the selected techniques, is the estimation of delays between channels using cross-correlation techniques. Such delays output can be also used as an input to the speaker diarization system to cluster the different speakers in a meeting room by their locations (derived from the delays). Although by themselves they do not carry as much information as the acoustic signal, when combined with it using a multi-stream diarization system (presented in this section) important gains are observed with respect to using acoustic alone.

First, in section 5.1 the real issues encountered in a meeting room multichannel layout are exposed and a filter-and-sum algorithm is proposed and described to process the data. Then in section 5.2 the full acoustic beamforming implementation developed and used for the speaker diarization and Automatic Speech Recognition (ASR) tasks is covered. Finally, in section 5.3 the use of the delays obtained from the speaker location estimation is explained, describing how they improve the acoustic diarization performance by combining both types of features and how the weighting between features is automatically computed.

Subsections

user 2008-12-08