Restoring Speaking Lips from Occlusion for Audio-Visual Speech Recognition

AAAI 2024(2024)

引用 0|浏览0
暂无评分
摘要
Prior studies on audio-visual speech recognition typically assume the visibility of speaking lips, ignoring the fact that visual occlusion occurs in real-world videos, thus adversely affecting recognition performance. To address this issue, we propose a framework that restores occluded lips in a video by utilizing both the video itself and the corresponding noisy audio. Specifically, the framework aims to achieve these three tasks: detecting occluded frames, masking occluded areas, and reconstruction of masked regions. We tackle the first two issues by utilizing the Class Activation Map (CAM) obtained from occluded frame detection to facilitate the masking of occluded areas. Additionally, we introduce a novel synthesis-matching strategy for the reconstruction to ensure the compatibility of audio features with different levels of occlusion. Our framework is evaluated in terms of Word Error Rate (WER) on the original videos, the videos corrupted by concealed lips, and the videos restored using the framework with several existing state-of-the-art audio-visual speech recognition methods. Experimental results substantiate that our framework significantly mitigates performance degradation resulting from lip occlusion. Under -5dB noise conditions, AV-Hubert's WER increases from 10.62% to 13.87% due to lip occlusion, but rebounds to 11.87% in conjunction with the proposed framework. Furthermore, the framework also demonstrates its capacity to produce natural synthesized images in qualitative assessments.
更多
查看译文
关键词
NLP: Speech,CV: Multi-modal Vision
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要