Reference Channel Computation

In a typical implementation of a time-delay based beamforming system one needs to select one of the channels as the reference channel. This channel is compared to all others and the time delay of arrival (TDOA) is estimated for each pair. It is important for this channel to be the best representative of the acoustics in the meeting, as the correct estimation of the delays of each of the channels depends on the chosen reference.

In the meetings transcribed by NIST to be used for the Rich Transcription evaluations (NIST Rich Transcription evaluations, website: http://www.nist.gov/speech/tests/rt, 2006) there is one microphone indicated to be the most centrally located in the room. Such microphone is chosen empirically given the room layout and the prior knowledge of the microphone types. This module overpasses that decision and selects one microphone automatically given a criterion based on acoustics. This is intended for system robustness in cases where absolutely no information on the room layout and the microphone placements is available. Two possible acoustic criterions were investigated to select such channel:

A selection based on Signal to Noise ratio (SNR). A simple speech/non-speech detection based on energy is applied to each of the channels independently and the SNR is computed. The channel with better SNR is chosen to be the reference channel. This poses a problem on how accurate is the speech/non-speech detection and how it correlates between channels. The algorithm implementation computed speech/non-speech for each channel independently and then computed the SNR for each one, giving mixed results. An SNR computation using some combined speech/non-speech technique where all channels could be taken into account to come up with one single segmentation could have improved this selection algorithm.
A selection based on average cross-correlation between channels: The cross-correlation (GCC-PHAT) is computed for all possible channel combinations for a block of duration 1s. This is repeated for blocks linearly spaced along the recording. For each channel the average cross-correlation is computed as:

$\displaystyle \frac{}{\mbox{cross\_correlation}_{i}} = \frac{1}{MN}\sum_{m=1}^{M} \sum_{j=1, j \neq i}^{N} xcorr(i, j)$ (5.7)

where is the number of channels and indicates the number of blocks used in the average. In the implementation GCC-PHAT cross-correlation was used as described below.
The channel with the highest average cross-correlation was chosen as reference channel. By using this metric it takes into account the amount of time each speaker speaks in total and the quality of each microphone. In the case where all microphones were the same and all speakers spoke the same amount of time, the chosen microphone should be the most physically centrally located one, coinciding with what NIST reports in the RT evaluations.

user 2008-12-08