Next:
Resum
Contents
[width=.75]figures/LogoBlueUPC.eps
PhD Thesis
R
OBUST
S
PEAKER
D
IARIZATION
FOR MEETINGS
Author:
Xavier Anguera Miró
Advisors:
Dr. Francisco Javier Hernando Pericás (UPC)
Dr. Chuck Wooters (ICSI)
Speech Processing Group
Department of Signal Theory and Communications
Universitat Politècnica de Catalunya
Barcelona, October 2006
Als meus pares,
(To my parents)
Resum
Contents
List of Tables
List of Figures
Introduction
Context and Motivations of this Thesis
Definition of the Thesis Objectives
Outline of the Thesis
State of the art
Acoustic Features for Speaker Diarization
Speaker Segmentation
Metric-Based Segmentation
Non Metric-Based Segmentation
Silence and Decoder-Based Segmentation
Model-Based Segmentation
Segmentation Using Other Techniques
Speaker Diarization
Hierarchical Clustering Techniques
Bottom-up Clustering Techniques
Top-down Clustering Techniques
Combination of Clustering Methods
Other Clustering Techniques
Use of Support Information in Diarization
Helping Diarization Using the Spoken Transcripts
Speaker Diarization Using Multi-Channel Information
Speaker Diarization in Meetings
Current Meeting Room Research Projects
Databases
NIST RT Speaker Diarization Systems for Meetings
NIST 2002 Speaker Recognition Evaluation
NIST 2004 Rich Transcription Spring Meeting Evaluation
NIST 2005 Rich Transcription Spring Meeting Evaluation
NIST 2006 Rich Transcription Spring Meeting Evaluation
Multichannel Acoustic Enhancement
Introduction to Acoustic Array Processing
Acoustic Signal Propagation
Passive Apertures
Linear Apertures Theory
Microphone Array Beamforming
Time Delay of Arrival Estimation
Speaker Diarization: from Broadcast News to Meetings
The ICSI Broadcast News System
Speech/non-Speech Detection and Parameters Extraction
Clusters Initialization and Acoustic Modeling
Clusters Comparison, Pruning and Clusters Merging
Stopping Criterion and System Output
Analysis of Differences from Broadcast News to Meetings
Input Data Analysis: Broadcast News versus Meetings
Signal to Noise Ratio
Average Total Speaking Time
Average Number of Speakers
Average Speaker Turn Duration
Meetings Domain Overlap Regions
Summary of Differences and Proposed Changes
Robust Speaker Diarization System for Meetings
Acoustic Signal Enhancement
Single Channel System Frontend
Speaker Clusters and Models Initialization
Models Training Using CV-EM and Clusters Segmentation
Clusters Merging and System Output
Acoustic Modeling Algorithms for Speaker Diarization in Meetings
Speech/Non-Speech Algorithm
Energy-Based Speech/non-Speech Detector with Variable threshold
Data Preprocessing
Derivative Filtering
Time Constraints on Speech/non-Speech
Model-based Speech/Non-Speech Decoder
Hybrid Speech/non-Speech Detection
Speaker Clusters Description and Modeling
Friends-and-Enemies Initialization
Initialization Algorithm Description
Clusters and Models Complexity Selection
Model Complexity Selection
Automatic Selection of the Initial Number of Clusters
Acoustic Modeling without Time Restrictions
Cluster Purification Algorithms
Frame-Level Cluster Purification
Speech and Non-Speech Modeling
Frame-Based Cluster Purification Metrics
Frame-Based Purification Implementation
Segment-Level Cluster Purification
Multichannel Processing for Meetings
Multichannel Acoustic Beamforming for Meetings
Meeting Room Microphone Array Characteristics
Filter-and-Sum Beamforming
Multichannel Acoustic Beamforming System Implementation
Individual Channels Signal Enhancement
Meeting Information Extraction
Reference Channel Computation
Overall Channels Weighting Factor
ICSI Meetings Skew Estimation
GCC-PHAT Cross-Correlation
TDOA Values Selection
TDOA Post-Processing
Dual-Pass Viterbi Post-Processing
Output Signal Generation
Automatic Channel Weight Adaptation
Automatic Adaptive Channel Elimination
Channels Sum and Output
Use of the Estimated Delays for Speaker Diarization
TDOA Modeling and Features Fusion
Automatic Features Weight Estimation
Experiments
Meetings Domain Experiments Setup
Baseline Systems
Databases
Evaluation Metrics
Diarization Error Rate
Signal-to-Noise Ratio
Reference Segmentation Selection and Calculation
Experiments from Broadcast News to Meetings
Speech/Non-Speech Detection Block
Acoustic Beamforming Experiments
Baseline System Analysis
Reference Channel Estimation Analysis
TDOA Post-Processing Analysis
Signal Output Algorithms Analysis
Use of the Beamformed Signal for ASR
Speaker Diarization Module Experiments
Individual Algorithms Performance
Number of Initial Clusters and Cluster Complexity selection
Cross-Validation EM Training
Friends-and-Enemies Clusters Initialization
Frame and Segment Purification
Multiple Feature Streams Automatic Weighting
Algorithms Agglomeration Performance
Overall Experiments and Analysis of Results
NIST Evaluations in Speaker Diarization
NIST Rich Transcription Evaluations in Speaker Diarization for Meetings
RT05s and RT06s Evaluation Conditions
Methodology of the Evaluations
Data used on the Speaker Diarization Evaluations
ICSI Participation in the RT Evaluations
Participation in the 2005 Spring Rich Transcription Evaluation
Conference Room Systems
Lecture Room Systems
RT05s Official Performance Scores
Participation in the 2006 Spring Rich Transcription Evaluation
RT06s Official Performance Scores
Pros and Cons of the NIST Evaluations
Conclusions
Overall Thesis Final Review
Review of Objectives Completion
Possible Future Work Topics
BIC Formulation for Gaussian Mixture Models
Rich Transcription evaluation datasets
Bibliography
About this document ...
user 2008-12-08