State of the art

In this chapter the main techniques used over the recent years on the task towards speaker diarizaton (i.e. speaker segmentation and clustering) and on acoustic beamforming are reviewed. Initially, the features that have been found suitable for speaker diarization are explained. Then, a look at the algorithms and systems to deal in general with the task at hand are introduced. Finally some ground is set on techniques oriented towards performing speaker diarization in meetings, being this the main domain of application of this thesis.

Speaker diarization can be defined in terms of being a subtype of audio diarization, where the speech segments of the signal are broken into the different speakers ((Reynolds and Torres-Carrasquillo, 2004)). It generally answers to the question ``Who spoke when?'' and it is sometimes referred to as speaker segmentation and clustering. In the domain of application of this thesis it is performed without any prior knowledge of the identity of the speakers in the recordings or how many are there. This, though, is not a requirement to call it speaker diarization as partial knowledge on the identities of some people in the recordings, the number of speakers or the structure of the audio (what follows what) might be available and used depending on the application at hand. None of these informations is provided in the RT evaluation campaigns organized by NIST (NIST Spring Rich Transcription Evaluation in Meetings website, http://www.nist.gov/speech/tests/rt/rt2005/spring, 2006) which is the task used to evaluate all the algorithms presented in this thesis.

According to Reynolds and Torres-Carrasquillo (2004), there are 3 main domains of application for speaker diarization that have received special attention over the years:

Broadcast news audio: Radio and TV programs with various kinds of programming, usually containing commercial breaks and music, over a single or stereo channel.
Recorded meetings: meetings or lectures where multiple people interact in the same room or over the phone. Normally recordings are made with several microphones.
Phone conversations: single channel recordings of phone conversations between two or more people. It is very much used in the speaker recognition campaigns but in disuse in diarization.

Furthermore, one could consider other particular domains, like air traffic communications, dialog in the car, and others.

As part of speaker diarization, speaker segmentation and speaker clustering belong to the pattern classification family, where one tries to find categorical (discrete) classes for continuous observations of speech and, by doing so, it finds the boundaries between them. Speech recognition is also a pattern classification problem. As such, they all need to work on a feature set that represents well the acoustic data and define a distance measure/method to assign each feature vector to a class.

In general, clustering data into classes is a well studied technique for statistical data analysis, with applications in many fields, including machine learning, data mining, pattern recognition, image analysis, bioinformatics and others.

When using clustering techniques for speaker or acoustic clustering one needs to previously define the segments that are going to be clustered, which might be of different sizes and characteristics (speech, non-speech, music, noises). In creating the segments using segmentation techniques one needs to be able to separate the speech stream into speakers and not words or phones. Any speech segment is populated with voiced and unvoiced phones, and short pauses between some phones or in speech prosody stops. A speaker segmentation and clustering algorithm needs to define the properties of the speaker's data to be assigned to a single speaker and define techniques to assign such data into a single cluster. To do so one needs to use the appropriate acoustic models, their parameters and training algorithms so that they identify differences correctly in the acoustics at the speaker level.

The first section in this chapter takes a look at the features that have proven useful for speaker-based processing (like speaker diarization). Emphasis is given to alternatives to the traditional features, to focus on speaker characteristics that better discriminate and help identify the speakers present in a recording.

Following the features review, an overview of the main techniques that have been used in the area of speaker segmentation and speaker diarization is pursued. Speaker segmentation is a first step in many speaker diarization systems and therefore it is found useful to review what techniques have been mainly used in the past and to create a ground theory for the speaker diarization review. After explaining the main speaker diarization systems focus will be geared towards speaker diarization for meetings, which is the focus of implementation in this thesis.

In meetings one usually encounters several available microphones for processing, they are all located inside the meetings room in several locations around the speakers. Although most of these microphones are not defined to form a microphone array in theory (the AMI microphone set is), in practice it is found useful to use microphone array beamforming techniques in order to combine the microphones data into one ``enhanced'' channel and then process only this channel using the diarization system. This has the advantage that the speaker diarization system stays totally transparent of the particularities of each meeting room setting and processes only one channel in any case, improving in speed, versus any other solutions involving some sort of processing of all channels in parallel.

In the last section of this state of the art review the main techniques currently available in acoustic beamforming will be covered, which have been applied in the implemented system in order to take advantage of the multiplicity of available microphones. First, an overview of the techniques used to obtain an ``enhanced'' signal as an output from multiple input signals is covered, and then possible ways to estimate the delay between each of these channels is explored, necessary in order to align the acoustic data, used in the majority of beamforming algorithms.

Subsections

user 2008-12-08