The first parameter obtained from the input signals is the signal to noise ratio (SNR). It was computed using the stnr tool from the NIST Speech Quality Assurance Package (SPQA) (NIST Speech tools and APIs, 2006) which is also used in the acoustic beamforming system evaluated in the experiments section. This program estimates the SNR of a file, defined as
To determine the noise average power, a raised cosine function is fitted to the peak in the left hand side of the histogram (lowest values) using a search algorithm to minimize the Chi-Square distance between the histogram and the function. The midpoint of such function is considered the mean noise power. Then the obtained raised cosine is subtracted from the histogram in order to estimate the speech power distribution. The peak speech power is defined as the histogram bin midpoint where the 95% of the power falls below it. Given that the speech power contains additive noise, the computed noise power is subtracted from it to use it in the SNR formula.
As speech and noise do not exist independently in the recorded signal, this method is only an approximation of the SNR. The result from this tool might not be comparable to the result from other tools, but according to the authors it is consistent to results using the same tool and therefore adequate to compare the quality difference of several signals as it is intended in this section. It must be noted that for a few cases this algorithm is known to give erroneous results, therefore it should be taken as an information source and the average should be taken to avoid misreadings.
In order to compute the SNR values, both for the meetings and for the broadcast news recording, only the regions determined to be part of the evaluation were considered. As pointed out before, some of the recordings contain more acoustic data than the evaluated region, which sometimes is excluded due to problems with the microphones (in meetings) or because it contains commercials or very noisy acoustics (in broadcast news).
First of all, the SNR is computed for the files in the RT04f data set. As it can be seen in table 3.4 the speech peak power remains constant at a very high value, with an average of 65db, while the noise average power is very variable and ranges from around 15db to around 62db. Such averages are taken over the SNR values in log domain, and aim at indicating the overall quality of the dataset.
Some of the shows contain news material where reporters give their chronicles from the field, with a high level of background noise, while the conductor is in the studio, with a very good quality microphone in a controlled environment. The shows that contain less or none of the field recordings achieve a very good SNR (around 50db) while others perform very poorly (for example the CNN headline news, ABC and CNBC shows).
In the meetings domain the SNR is computed separately for the conference and lecture room sets. On the conference room set each of the rooms contains a variable number of microphones, mostly separated into 2 groups: the microphones situated in the middle of the table (labelled MDM) and the head-mounted microphones, worn by some of the participants (labelled IHM). Although the speaker diarization system presented in this thesis does not analyze the IHM case, the SNR for these microphones is also computed for comparison purposes.
Tables 3.5 and 3.6 show average SNR for the MDM and IHM channels in the RT06s meetings in the conference room. In both tables the number of microphones available is indicated in the second column. Then, the third though fifth columns indicate the average (in the linear domain) of the SNR values for all channels in each meeting. As the variety of microphones causes them to have very diverse quality levels, the last two columns indicate the maximum and minimum SNR values to give an idea of how disperse these are. Finally, the averages (in the log domain, as done in the broadcast news results) are computed for all meetings.
The speech quality for all cases is approximately the same (around 65db). The noise level for the MDM channels is much higher than the Broadcast news channels, which causes a decrease in SNR of almost 5db. The Average noise level is lower for the IHM channels than for the MDM or the broadcast news shows which leads to an overall better SNR. This is due to the proximity of these microphones to the speakers and that a meeting room contains less noise than some broadcast news shows. It is interesting to point out the outstanding quality of the IHM channels used in the Edimburgh recordings (within the AMI project), but at the same time these meetings have some of the worse quality MDM microphones.
In overall, the MDM channels in the conference room are of less quality than the average in the broadcast news, but they remain more constant in quality across meetings.
Finally, table 3.7 shows the computed averages for the RT06s meetings in the lecture room dataset. In the same way as in 3.5, for each recording several distant microphones are available. The Average between microphones is done in the linear domain while the average over all recordings is done in the log domain.
Although all meeting recordings were done within the CHIL project, the specifications on the room layout and on the acoustic environment change within each lecture room. The speech average peak power changes immensely among the rooms (from around 50db on AIT recordings to around 80db on UKA decordings) remaining stable within the same lecture room. The same thing happens with the noise average power, which is the lowest for the AIT recordings and the highest for the UKA. This indicates that the recording settings were not set equally for all settings, being such difference possibly due solely to the amplification applied to the signal by the recording equipment.
Regarding the SNR over all the channels, the AIT recordings are constantly achieving SNR values on the twenties, while the other shows are usually on the tens, with a global average of 18.47, which is slightly lower in average than the meetings in the conference room subdomain. The differences between minimum and maximum SNR values remain in the same line as in 3.5.