A Survivor in the Era of Large-Scale Pretraining: An Empirical Study of One-Stage Referring Expression Comprehension

IEEE TRANSACTIONS ON MULTIMEDIA(2024)

引用 0|浏览6
暂无评分
摘要
One-stage Referring Expression Comprehension (REC) is a task that requires accurate alignment between text descriptions and visual content. In recent years, numerous efforts have been devoted to cross-modal learning for REC, while the influence of other factors in this task still lacks a systematic study. To fill this gap, we conduct an empirical study in this article. Concretely, we ablate 42 candidate designs/settings based on a common REC framework, and these candidates cover the entire process of one-stage REC from network design to model training. Afterwards, we conduct over 100 experimental trials on three REC benchmark datasets. The extensive experimental results reveal the key factors that affect REC performance in addition to multi-modal fusion, e.g., multi-scale features and data augmentation. Based on these findings, we further propose a simple yet strong model called SimREC, which achieves new state-of-the-art performance on these benchmarks. In addition to these progresses, we also find that with much less training overhead and parameters, SimREC can achieve better performance than a set of large-scale pre-trained models, e.g., UNITER and VILLA, portraying the special role of REC in existing V&L research.
更多
查看译文
关键词
Task analysis,Visualization,Training,Head,Cognition,Systematics,Sun,Computer vision,object recognition
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要