Acoustic Modeling without Time Restrictions

In this section a small change to the cluster models is proposed which leads to the elimination of the dependency of the acoustic models on the average speaker turn length. This is achieved by modifying the acoustic modeling topology by changing the probabilities of self-loop and transition in the last state. By doing so, a minimum duration for a speaker turn can be implemented like in the past while not influencing the final duration of a speaker turn. While setting a minimum duration for speaker turns is advantageous for the processing of the recordings and can be set to be independent of the kind of recording encountered, the average speaker turn duration is quite variable between individual recordings and domains. It is therefore better to let the acoustic data alone define when the speaker turn finishes once it achieves a minimum length.

In the cluster models each state contains a set of $ MD$ sub-states, as seen in figure 4.7, imposing a minimum duration of each model. Each one of the sub-states has a probability density function modeled via a Gaussian mixture model (GMM). The same GMM model is tied to all sub-states in any given state. Upon entering a state, at time $ n$ the model forces a jump to the following sub-state with probability $ 1.0$ until the last sub-state is reached. In that sub-state, it can remain in the same sub-state with transition weight $ \alpha$, or jump to the first sub-state of another state with weight $ \beta/M$, where $ M$ is the number of active states/clusters at that time. In the baseline system these were set to $ \alpha=0.9$ and $ \beta=0.1$ (summing to 1).

Figure 4.7: Cluster models with Minimum duration and modified probabilities

One disadvantage of using these settings is that it creates an implicit duration model on the data beyond the minimum duration $ MD$, set as a parameter. Let us consider a sequence of N feature vectors X={x[1] ...x[N]}. Let us also consider a set of K cluster models $ \Theta=\{\Theta_{1} \dots \Theta_{K}$}. The system imposes an equal probability to choose either cluster once it outputs a prior cluster and has a minimum duration $ MD$ inside either cluster.

In order to study the interaction between $ \alpha$, $ \beta$ and $ MD$ parameters, the likelihood of the data given the models is analyzed. In equation 4.11 the likelihood is written when the system selects model 1 as the initial model and stays in it for the whole N acoustic frames, therefore creating 0 model changes as

$\displaystyle \mathcal{L}_{0}(X\vert\Theta)$ $\displaystyle =$ $\displaystyle \mathcal{L}(x[1]\vert\Theta_{1})
\prod_{i=2}^{MD}(1 \cdot
    $\displaystyle \cdot \prod_{i=MD+1}^{N}(\alpha \cdot
\mathcal{L}(x[i]\vert\Theta_{1}))$ (4.11)

In equation 4.12 the likelihood is computed for the case when one cluster change occurs within the decoded N frames. The decoding used imposes that the second model will contain at least $ MD$ acoustic frames. Considering models 1 and 2 it can be written as:

$\displaystyle \mathcal{L}_{1}(X\vert\Theta)$ $\displaystyle =$ $\displaystyle \mathcal{L}(x[1]\vert\Theta_{1})
\prod_{i=2}^{MD}(1 \cdot \mathc...
...) \cdot
\prod_{i=MD+1}^{N_{1}}(\alpha \cdot
    $\displaystyle \cdot \frac{\beta}{K} \prod_{i=N_{1}}^{N_{1}+MD}(1 \cdot
\prod_{i=N_{1}+MD+1}^{N}(\alpha \cdot
\mathcal{L}(x[i]\vert\Theta_{2}))$ (4.12)

where $ N_{1}$ indicates a random point in the $ N$ frames, as long as $ N_{1}>MD$ and $ N_{1}<N-MD$.

The transition probabilities from these equations are the terms not affected by the acoustic models. By extending the number of changes to C, the transition probability can be proven that takes the expression:

$\displaystyle Tr(C)=\left(\frac{\beta}{K}\right)^{C} \alpha^{(N-(C+1)MD)}$ (4.13)

It is composed of two parts. On one hand, the left side depends on the $ \beta$ parameter and depends exclusively on the number of cluster changes and the number of possible clusters to go to. On the other hand, the right side is dependent on the $ \alpha$ parameter and encodes the duration modeling of each of the acoustic models. This duration model depends on the number of speaker changes $ C$ and the minimum duration $ MD$.

On the broadcast news system the parameters were set as $ \alpha=0.9$, $ \beta=0.1$ and $ MD=3$ seconds. This led to a transition probability which is dependent on $ C$ and $ MD$, which for many cases created segments that in average were very close to duration $ MD$. This was because on most cases when evaluating on N frames of data, $ \mathcal{L}_{i \ne 0}(X\vert\Theta) >
\mathcal{L}_{0}(X\vert\Theta)$. In order to avoid cluster changes every $ MD$ seconds a lower boundary for $ \alpha$ must be set by ensuring that $ tr_{i\ne0} < tr_{0}$ computed for a hypothetic case when all models are the same (i.e. $ \Theta_{i} = \Theta_{j},
\forall i,j$). Applying this condition to the transition probabilities for all possible $ C$ values gives:

$\displaystyle \alpha^{MD} > \frac{\beta}{K}$ (4.14)

In order to remove the dependency of the $ MD$ on duration modeling, and agreeing with equation 4.14, the parameters were set as $ \alpha=1.0$ and $ \beta=1.0$. Thus, once a segment exceeds the minimum duration, the HMM state transitions no longer influence the speaker turn length; it is solely governed by acoustics. This creates a non-standard (but valid) HMM topology as $ \alpha + \beta$ no longer sums to 1.

user 2008-12-08