Performing an acoustic beamforming of the multiple input signals has multiple advantages, including the simplicity of the following speaker diarization system, which can be reused from the broadcast news system as it only needs to compute the output for a single acoustic channel. Another advantage is the independence of the proposed system to the room layout and number of microphones.
By doing an acoustic beamforming there comes a drawback in that all spatial information about the speaker location, which is carried by the multiple microphones in the room, is lost in the process. For this reason when multiple microphones are available (and therefore a beamforming is performed) the speaker location information is reused for the speaker diarization module. Such information comes from the Time Delay of Arrival (TDOA) values between each microphone and the reference channel. Although extensive research has gone into speaker localization using multiple microphones (including the identification of each speaker from the others), this is only possible when the topology and exact location of all microphones is known in advance. This is not considered in the current implementation of the system as multiple room topologies are to be processed and, for some of them, the microphones locations are not known.
Apart from the TDOA values, there are other features that could be useful to determine the difference between speakers. One such possibility is the relative amplitude between the different channels, which should be able to identify whoever is closer to what microphone, being therefore an indicator of location of the speakers. This metric is though very correlated to the TDOA values, suffering from the same problems, and therefore has not been considered in this thesis. Further study should be done to indicate wether using both information streams could lead to further improved speaker diarization.