Diarization Error Rate

The main metric that is used for speaker diarization experiments is the Diarization Error Rate (DER) as described and used by NIST in the RT evaluations (NIST Fall Rich Transcription on meetings 2006 Evaluation Plan, 2006). It is measured as the fraction of time that is not attributed correctly to a speaker or to non-speech. To measure it, a script names MD-eval-v12.pl (NIST MD-eval-v21 DER evaluation script, 2006), developed by NIST, was used.

As per the definition of the task, the system hypothesis diarization output does not need to identify the speakers by name or definite ID, therefore the ID tags assigned to the speakers in both the hypothesis and the reference segmentation do not need to be the same. This is unlike the non-speech tags, which are marked as non labelled gaps between two speaker segments, and therefore do implicitly need to be identified.

The evaluation script first does an optimum one-to-one mapping of all speaker label ID between hypothesis and reference files. This allows the scoring of different ID tags between the two files. The Diarization Error Rate score is computed as

$\displaystyle DER = \frac{\sum_{s=1}^{S} \mbox{dur}(s) \cdot (\mbox{max}(N_{re... ...), N_{hyp}(s)) - N_{correct}(s))}{\sum_{s=1}^{S} \mbox{dur}(s) \cdot N_{ref}}$

(6.1)

where S is the total number of speaker segments where both reference and hypothesis files contain the same speaker/s pair/s. It is obtained by collapsing together the hypothesis and reference speaker turns. The terms $N_{ref}(s)$ and $N_{sys}(s)$ indicate the number of speaker speaking in segment , and $N_{correct}(s)$ indicates the number of speakers that speak in segment and have been correctly matched between reference and hypothesis. Segments labelled as non-speech are considered to contain 0 speakers. When all speakers/non-speech in a segment are correctly matched the error for that segment is 0.

The DER error can be decomposed into the errors coming from the different sources, which are:

Speaker error: percentage of scored time that a speaker ID is assigned to the wrong speaker. This type of error does not account for speakers in overlap not detected or any error coming from non-speech frames. It can be written as

$\displaystyle E_{Spkr} = \frac{\sum_{s=1}^{S} \mbox{dur}(s) \cdot (\mbox{min}(N_{ref}(s), N_{hyp}(s)) - N_{correct}(s))}{T_{score}}$ (6.2)

where $T_{score} = \sum_{s=1}^{S}$ dur $(s) \cdot N_{ref}$ is the total scoring time, in the denominator in eq. 6.1.
False alarm speech: percentage of scored time that a hypothesized speaker is labelled as a non-speech in the reference. It can be formulated as

$\displaystyle E_{FA} = \frac{\sum_{s=1}^{S} \mbox{dur}(s) \cdot (N_{hyp}(s) -N_{ref}(s))}{T_{score}} \ \ \ \ \forall \ (N_{hyp}(s) -N_{ref}(s)) > 0$ (6.3)

computed only over segments where the reference segment is labelled as non-speech.
Missed speech: percentage of scored time that a hypothesized non-speech segment corresponds to a reference speaker segment. It can be expressed as

$\displaystyle E_{MISS} = \frac{\sum_{s=1}^{S} \mbox{dur}(s) \cdot (N_{ref}(s) -N_{hyp}(s))}{T_{score}} \ \ \ \ \forall \ (N_{ref}(s) -N_{hyp}(s)) > 0$ (6.4)

computed only over segments where the hypothesis segment is labelled as non-speech.
Overlap speaker: percentage of scored time that some of the multiple speakers in a segment do not get assigned to any speaker. This errors usually fuses either into the $E_{MISS}$ or $E_{FA}$ , depending on wether it is the reference or the hypothesis containing non assigned speakers. If multiple speakers appear in both the reference and the hypothesis the error produced belongs to $E_{spkr}$ .

Given all possible errors one can rewrite equation 6.1 as

$\displaystyle DER = E_{spkr} + E_{MISS} + E_{FA} + E_{ovl}$

(6.5)

When evaluating performance, a collar around every reference speaker turn can be defined which accounts for inexactitudes in the labelling of the data. It was estimated by NIST that a $\pm$ 250ms collar could account for all these differences. When there is people overlapping each other in the recording it is stated so in the reference file, with as many as 5 speaker turns being assigned to the same time instant. As pointed out in the denominator of eq. 6.1, the total evaluated time includes the overlaps. Errors produced when the system does not detect any or some of the multiple speakers in overlap count as missed speaker errors.

Once the performance is obtained for each individual meeting excerpt, the time weighted average is done among all meetings in a given set to obtain an overall average score. The scored time is the one used for such weighting, as it indicates the total (overlapped speaker included) time that has been evaluated in each excerpt.

user 2008-12-08