Parrot Captions Teach CLIP to Spot Text
CoRR(2023)
摘要
Despite CLIP being the foundation model in numerous vision-language
applications, the CLIP suffers from a severe text spotting bias. Such bias
causes CLIP models to `Parrot' the visual text embedded within images while
disregarding the authentic visual semantics. We uncover that in the most
popular image-text dataset LAION-2B, the captions also densely parrot (spell)
the text embedded in images. Our analysis shows that around \textbf{50\%} of
images are embedded with visual text content, and \textbf{90\%} of their
captions more or less parrot the visual text. Based on such observation, we
thoroughly inspect the different release d versions of CLIP models and verify
that the visual text is the dominant factor in measuring the LAION-style
image-text similarity for these models. To examine whether these parrot
captions shape the text spotting bias, we train a series of CLIP models with
LAION subsets curated by different parrot-caption-oriented criteria. We show
that training with parrot captions easily shapes such bias but harms the
expected visual-language representation learning in CLIP models. This suggests
that it is urgent to revisit either the design of CLIP-like models or the
existing image-text dataset curation pipeline built on CLIP score filtering.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要