Attention-Based Multimodal Deep Learning on Vision-Language Data: Models, Datasets, Tasks, Evaluation Metrics and Applications.

Priyankar Bose,Pratip Rana,Preetam Ghosh

IEEE Access（2023）

引用 0|浏览20

暂无评分

摘要

Multimodal learning has gained immense popularity due to the explosive growth in the volume of image and textual data in various domains. Vision-language heterogeneous multimodal data has been utilized to solve a variety of tasks including classification, image segmentation, image captioning, question-answering, etc. Consequently, several attention mechanism-based approaches with deep learning have been proposed on image-text multimodal data. In this paper, we highlight the current status of attention-based deep learning approaches on vision-language multimodal data by presenting a detailed description of the existing models, their performances and the variety of evaluation metrics used therein. We revisited the various attention mechanisms on image-text multimodal data since its inception in 2015 till 2022 and considered a total of 75 articles for the survey. Our comprehensive discussion also encompasses the current tasks, datasets, application areas and future directions in this domain. This is the very first attempt to discuss the vast scope of attention-based deep learning mechanisms on image-text multimodal data.

查看译文

关键词

multimodal deep learning,deep learning,attention-based,vision-language

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要