One-Stream Vision-Language Memory Network for Object Tracking

IEEE TRANSACTIONS ON MULTIMEDIA(2024)

引用 0|浏览3
暂无评分
摘要
Most existing tracking methods try to represent the target by exploiting visual information as much as possible based on the various deep networks. However, the appearance model hardly describes the attribute feature of the target well, which makes the trackers fail to adapt to the complex visual surrounding. In this article, inspired by brain-like intelligence, we propose an One-stream Vision-Language Memory network (OVLM) for object tracking. Firstly, we use the combination of vision and language to build the target model and use the semantic information in the language to compensate for the instability of visual information, making the target model more stable in the face of complex appearance changes. Secondly, to build a more compact target model, we propose a memory token selection mechanism that utilizes linguistic information to eliminate tokens that do not contain target information. Furthermore, to provide better visual information for target modeling, we propose a language-based evaluation method to select high-quality target samples to be stored in the memory. Finally, OVLM achieves a 64.7% success rate on the large-scale tracking benchmark dataset TNL2K, outperforming the previous best result (VLT) by 11.6%. By exposing the possibility of the vision-language memory network, we aim to draw greater attention to it and open up new avenues for vision-language tracking.
更多
查看译文
关键词
Target tracking,Visualization,Linguistics,Iron,Feature extraction,Computational modeling,Adaptation models,Object tracking,vision-language,one-stream,memory network
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要