Can Large Multimodal Models Uncover Deep Semantics Behind Images?

Yixin Yang, Zheng Li,Qingxiu Dong,Heming Xia,Zhifang Sui

CoRR(2024)

引用 0|浏览11
暂无评分
摘要
Understanding the deep semantics of images is essential in the era dominated by social media. However, current research works primarily on the superficial description of images, revealing a notable deficiency in the systematic investigation of the inherent deep semantics. In this work, we introduce DEEPEVAL, a comprehensive benchmark to assess Large Multimodal Models' (LMMs) capacities of visual deep semantics. DEEPEVAL includes human-annotated dataset and three progressive subtasks: fine-grained description selection, in-depth title matching, and deep semantics understanding. Utilizing DEEPEVAL, we evaluate 9 open-source LMMs and GPT-4V(ision).Our evaluation demonstrates a substantial gap between the deep semantic comprehension capabilities of existing LMMs and humans. For example, GPT-4V is 30 understanding deep semantics, even though it achieves human-comparable performance in image description. Further analysis indicates that the integration of description texts during the inference process notably enhances LMMs' ability to perceive deep semantics. Furthermore, our dataset is divided into multiple categories, and we conducted a more detailed analysis within these categories.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要