Context-Aware Integration of Language and Visual References for Natural Language Tracking
arxiv(2024)
摘要
Tracking by natural language specification (TNL) aims to consistently
localize a target in a video sequence given a linguistic description in the
initial frame. Existing methodologies perform language-based and template-based
matching for target reasoning separately and merge the matching results from
two sources, which suffer from tracking drift when language and visual
templates miss-align with the dynamic target state and ambiguity in the later
merging stage. To tackle the issues, we propose a joint multi-modal tracking
framework with 1) a prompt modulation module to leverage the complementarity
between temporal visual templates and language expressions, enabling precise
and context-aware appearance and linguistic cues, and 2) a unified target
decoding module to integrate the multi-modal reference cues and executes the
integrated queries on the search image to predict the target location in an
end-to-end manner directly. This design ensures spatio-temporal consistency by
leveraging historical visual information and introduces an integrated solution,
generating predictions in a single step. Extensive experiments conducted on
TNL2K, OTB-Lang, LaSOT, and RefCOCOg validate the efficacy of our proposed
approach. The results demonstrate competitive performance against
state-of-the-art methods for both tracking and grounding.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要