Learning Dynamic Stream Weights For Coupled-HMM-Based Audio-Visual Speech Recognition

IEEE/ACM Transactions on Audio, Speech & Language Processing(2015)

引用 69|浏览162
暂无评分
摘要
With the increasing use of multimedia data in communication technologies, the idea of employing visual information in automatic speech recognition (ASR) has recently gathered momentum. In conjunction with the acoustical information, the visual data enhances the recognition performance and improves the robustness of ASR systems in noisy and reverberant environments. In audio-visual systems, dynamic weighting of audio and video streams according to their instantaneous confidence is essential for reliably and systematically achieving high performance. In this paper, we present a complete framework that allows blind estimation of dynamic stream weights for audio-visual speech recognition based on coupled hidden Markov models (CHMMs). As a stream weight estimator, we consider using multilayer perceptrons and logistic functions to map multidimensional reliability measure features to audiovisual stream weights. Training the parameters of the stream weight estimator requires numerous input-output tuples of reliability measure features and their corresponding stream weights. We estimate these stream weights based on oracle knowledge using an expectation maximization algorithm. We define 31-dimensional feature vectors that combine model-based and signal-based reliability measures as inputs to the stream weight estimator. During decoding, the trained stream weight estimator is used to blindly estimate stream weights. The entire framework is evaluated using the Grid audio-visual corpus and compared to state-of-the-art stream weight estimation strategies. The proposed framework significantly enhances the performance of the audio-visual ASR system in all examined test conditions.
更多
查看译文
关键词
audio streaming,audio-visual systems,decoding,expectation-maximisation algorithm,feature extraction,hidden markov models,learning (artificial intelligence),multilayer perceptrons,speech coding,speech recognition,video streaming,31-dimensional feature vector,asr,chmm,acoustical information,communication technology,coupled hidden markov model,coupled-hmm-based audio-visual automatic speech recognition,expectation maximization algorithm,grid audio-visual corpus,learning dynamic stream weight blind estimation,logistic function,multidimensional reliability,multilayer perceptron,multimedia data,oracle knowledge,signal-based reliability,audio-visual speech recognition,logistic regression,reliability measure,stream weight,reliability,learning artificial intelligence,vectors,speech
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要