Revisiting audio visual scene-aware dialog

Neurocomputing(2022)

引用 1|浏览30
暂无评分
摘要
Audio Visual Scene-Aware Dialog (AVSD) has drawn intense interests, in which models are required to understand dynamic scenes in videos and dialog contexts in order to converse with human users by generating responses to given questions. Existing works have laid a solid foundation towards solving the AVSD problem. In contrast to previous studies, this paper empirically revisits the AVSD task and argues that this task exhibits a variety of biases in terms of models, dataset, and evaluation metrics: (1) as for the models, we believe that the state-of-the-art frameworks do not utilize multimodal features to their full extent; (2) as for the dataset, we conduct a deep analysis into dataset statistics from different types of questions and find that the dataset is slightly biased in several specific aspects; by simply implementing a caption-only baseline that has never seen the video, we achieve state-of-the-art performance on the AVSD task; (3) as for the evaluation metrics, we argue that the current metrics for AVSD primarily focus on the naturalness of generated responses while ignoring the truthfulness, which makes them fall short of disclosing the consistency of model predictions and the actual visual content. Overall, our analysis aims to provide a detailed inspection of the AVSD task and we hope that our empirical observations can inspire further improvement to the task.
更多
查看译文
关键词
Multimodal dialog systems,Modality bias,Multimodal evaluation
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要