Unified Multi-Modality Video Object Segmentation Using Reinforcement Learning

IEEE Transactions on Circuits and Systems for Video Technology(2023)

引用 0|浏览1
暂无评分
摘要
The main task we aim to tackle is the multi-modality video object segmentation (VOS), which can be divided into two sub-tasks: mask-referred and language-referred VOS, where the first-frame mask-level or language-level label is utilized to provide the target information, respectively. Due to the huge gap between different modalities, existing works never come up with a unified framework for these two sub-tasks. In this work, such a unified framework is designed, where the visual and linguistic inputs are first spilt into a number of image patches and words, and then mapped into same-size tokens, which are equally processed by a self-attention based segmentation model. Furthermore, to highlight the significant information and discard the non-target or ambiguous one, unified multi-modality filter networks are further designed, and reinforcement learning is adopted to optimize such networks. Experiments show that new state-of-the-art performances are achieved by the proposed method: 52.8% of J & F on Ref-YoutubeVOS dataset and 83.2% of JS on YoutubeVOS dataset, respectively. The code will be released.
更多
查看译文
关键词
Video Object Segmentation,Multiple Modalities,Reinforcement Learning
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要