Due to the use of a minimum duration in the acoustic modeling, speech segments that legitimately belong to a particular cluster can be ``infected'' with sets of non-speech frames and frames belonging to other sources. Such sets are too short to be taken into account by the segment-based decoding as independent clusters or eliminated by the model-based speech/non-speech detector without an important increase in miss speech error. They cause the models to diverge from their acoustic modeling targets, which is particularly important when considering whether to merge two clusters. The frame level purification presented here focuses on detecting and eliminating the non-speech frames that do not help to discriminate between speakers (e.g. short pauses, occlusive silences, low-information fricatives, etc).