Overall Channels Weighting Factor

The input signal to the filter-and-sum module is typically a 16bit, 16KHz signal, and the output being treated by the diarization system is of the same characteristics. By using 16 bits it can represent values from -32767 to +32768 in a single channel in steps of 1 (resolution of the input). Such resolution gets modified when performing the weighted sum of signals as the resolution becomes smaller than 1 (the range of possible values of the summed signal depends of the weights of the individual signals, it would be $\frac{1}{N}$ for equal weighing). Although a higher resolution is available after the sum, the signal needs to be quantized to steps of unit value to fit it into the 16bit output channel, therefore getting a quantization error at each frame.

As the use of a signal output using more bits (like using floating points) creates an inconsistency with the standard signals used in the system and therefore was not considered as feasible, two simple modifications were done in order to minimize the amount of quantization error whenever possible. These are:

The input signals usually does not cover all the dynamic range used by the 16 bits available (or only a few instants in the meeting do). A scaling factor was defined for all signals so that the sum of them will have a dynamic range closer to the available output, minimizing the quantization errors of the output signal.
There are several alternatives in signal processing to find maximum values of a signal in order to normalize it. Some alternatives are to compute the absolute maximum amplitude over all the show, or the Root Mean Square (RMS) value, or other variations of it involving a histogram of the signal (for example, taking the maximum as the 80% of such histogram).
It was observed that the processed signal contains very low energy areas (silence regions) with short duration in average, and very high energy areas (impulsive noises, like door slams, or common laughs or discussions), with even shorter duration. By using the absolute maximum or RMS it would saturate the normalizing factor to the highest possible value or bias it according to the amount of silence in the show. A windowed maximum averaging was implemented instead in blocks of T=10 seconds to ensure that every block is highly probable to contain some speech. In each block the maximum value is found and averaged over all the recording. Such average is used to obtain the overall weighting factor for the signal in terms of the average maximum of each of the channels as

$\displaystyle W_{all} = \frac{1}{N} \sum_{n=1}^{N} \frac{1}{M} \sum_{m=1}^{M} max\{x[n+\frac{T(m-1)}{f_{s}}], \cdots, x[n+\frac{Tm}{f_{s}}]\}$ (5.8)
The quantization of the output signal is necessary to convert from a floating point value (obtained from the sum of all delayed-weighted-summed signals) to a 16bit signal. It is quantized to the closest integer value within the range $\pm$ 32767, allowing a maximum quantization error of value $\pm 0.5$ instead of using the standard functions ``int'' or ``floor'' in C, which considers a maximum error of 1.

user 2008-12-08