Both the individual channel enhancement block and the acoustic fusion block aim at obtaining a signal with a better quality than the original in order to improve the performance of the diarization system.
The individual channels are first Wiener-filtered (Wiener and Norbert, 1949) to improve the SNR with the same algorithm as in the ICSI-SRI-UW Meetings recognition system (Mirghafori et al., 2004), which uses a noise reduction algorithm developed for the Aurora 2 front-end, proposed originally in Adami, Burget, Dupont, Garudadri, Grezl, Hermansky, Jain, Kajarekar, Morgan and Sivadas (2002). The algorithm performs Wiener filtering with typical engineering modifications, such as a noise over-estimation factor, smoothing of the filter response, and a spectral flooring. The original algorithm was modified to use a single noise spectral estimate for each meeting waveform. This was calculated over all the frames judged to be non-speech by the voice-activity detection component within the module. As observed in Figure 3.5, the algorithm is independently applied to each meeting channel and uses overlap-add resynthesis to create noise-reduced output waveforms, which then are either fed into the acoustic fusion block (multi-channel) or directly into the segmentation and clustering block (single-channel).
The acoustic fusion module makes use of standard beamforming techniques in order to obtain an ``enhanced'' version of the signal as a combination of the multiple channel input signals. It considers the multiple channels to form a microphone array. Neither the microphone positions nor their acoustic properties are known. Given these constraints, a variation of the simple (yet effective) delay&sum beamforming technique is applied as it does not require any information from the microphones in order to operate. As the different microphones have a different acoustic directivity pattern and are located in places in the room where the noise level is different, a dynamic weighting of the individual channels and a triangular filtering is used to reduce its negative effects. By using such channel filtering the system will be referred to as filter&sum from now on.
The filter&sum beamforming technique involves estimating the relative time delay of arrival (TDOA) of the acoustic signal with respect to a reference channel. The GCC-PHAT (Generalized Cross Correlation with Phase Transform) is used to find the potential relative delays regarding each of the speakers in the meeting. In order to avoid impulsive noise, short-term events and overlap speech from tainting the correct approximation of the TDOA, multiple TDOA values are computed for each time and a double post-processing algorithm is implemented to select the most appropriate value. On one hand, noise is detected by measuring the quality of the computed cross-correlation values at each point with respect to the rest of the meeting and the computed TDOA values are substituted by the previous (more reliable) values when considered too low. On the other hand, impulsive events and overlap is dealt with by using a double-step Viterbi decoding of the delays in order to obtain the optimum set of TDOA values that are both reliable and stable. A more in depth explanation of these and other steps involved in the acoustic fusion block is given in chapter 5.
Apart from using the post-processed estimated delays for the filter&sum beamforming, they are also used in the segmentation and clustering block as they can convey information about the speaker through his/her location in the room. Such information is orthogonal to the acoustic information and therefore adds useful information to the diarization system. In section 5.3 the combination of both features is presented in detail.