Prompting Large Vision-Language Models for Compositional Reasoning
CoRR(2024)
摘要
Vision-language models such as CLIP have shown impressive capabilities in
encoding texts and images into aligned embeddings, enabling the retrieval of
multimodal data in a shared embedding space. However, these embedding-based
models still face challenges in effectively matching images and texts with
similar visio-linguistic compositionality, as evidenced by their performance on
the recent Winoground dataset. In this paper, we argue that this limitation
stems from two factors: the use of single vector representations for complex
multimodal data, and the absence of step-by-step reasoning in these
embedding-based methods. To address this issue, we make an exploratory step
using a novel generative method that prompts large vision-language models
(e.g., GPT-4) to depict images and perform compositional reasoning. Our method
outperforms other embedding-based methods on the Winoground dataset, and
obtains further improvement of up to 10
optimal description.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要