SUGARCREPE++ Dataset: Vision-Language Model Sensitivity to Semantic and Lexical Alterations
CoRR(2024)
摘要
Despite their remarkable successes, state-of-the-art large language models
(LLMs), including vision-and-language models (VLMs) and unimodal language
models (ULMs), fail to understand precise semantics. For example, semantically
equivalent sentences expressed using different lexical compositions elicit
diverging representations. The degree of this divergence and its impact on
encoded semantics is not very well understood. In this paper, we introduce the
SUGARCREPE++ dataset to analyze the sensitivity of VLMs and ULMs to lexical and
semantic alterations. Each sample in SUGARCREPE++ dataset consists of an image
and a corresponding triplet of captions: a pair of semantically equivalent but
lexically different positive captions and one hard negative caption. This poses
a 3-way semantic (in)equivalence problem to the language models. We
comprehensively evaluate VLMs and ULMs that differ in architecture,
pre-training objectives and datasets to benchmark the performance of
SUGARCREPE++ dataset. Experimental results highlight the difficulties of VLMs
in distinguishing between lexical and semantic variations, particularly in
object attributes and spatial relations. Although VLMs with larger pre-training
datasets, model sizes, and multiple pre-training objectives achieve better
performance on SUGARCREPE++, there is a significant opportunity for
improvement. We show that all the models which achieve better performance on
compositionality datasets need not perform equally well on SUGARCREPE++,
signifying that compositionality alone may not be sufficient for understanding
semantic and lexical alterations. Given the importance of the property that the
SUGARCREPE++ dataset targets, it serves as a new challenge to the
vision-and-language community.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要