Design as Desired: Utilizing Visual Question Answering for Multimodal Pre-training
arxiv(2024)
摘要
Multimodal pre-training demonstrates its potential in the medical domain,
which learns medical visual representations from paired medical reports.
However, many pre-training tasks require extra annotations from clinicians, and
most of them fail to explicitly guide the model to learn the desired features
of different pathologies. To the best of our knowledge, we are the first to
utilize Visual Question Answering (VQA) for multimodal pre-training to guide
the framework focusing on targeted pathological features. In this work, we
leverage descriptions in medical reports to design multi-granular
question-answer pairs associated with different diseases, which assist the
framework in pre-training without requiring extra annotations from experts. We
also propose a novel pre-training framework with a quasi-textual feature
transformer, a module designed to transform visual features into a
quasi-textual space closer to the textual domain via a contrastive learning
strategy. This narrows the vision-language gap and facilitates modality
alignment. Our framework is applied to four downstream tasks: report
generation, classification, segmentation, and detection across five datasets.
Extensive experiments demonstrate the superiority of our framework compared to
other state-of-the-art methods. Our code will be released upon acceptance.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要