Unified Multi-Modality Video Object Segmentation Using Reinforcement Learning

Mingjie Sun,Jimin Xiao,Eng Gee Lim,Cairong Zhao,Yao Zhao

IEEE Transactions on Circuits and Systems for Video Technology（2023）

引用 0|浏览1

暂无评分

摘要

The main task we aim to tackle is the multi-modality video object segmentation (VOS), which can be divided into two sub-tasks: mask-referred and language-referred VOS, where the first-frame mask-level or language-level label is utilized to provide the target information, respectively. Due to the huge gap between different modalities, existing works never come up with a unified framework for these two sub-tasks. In this work, such a unified framework is designed, where the visual and linguistic inputs are first spilt into a number of image patches and words, and then mapped into same-size tokens, which are equally processed by a self-attention based segmentation model. Furthermore, to highlight the significant information and discard the non-target or ambiguous one, unified multi-modality filter networks are further designed, and reinforcement learning is adopted to optimize such networks. Experiments show that new state-of-the-art performances are achieved by the proposed method: 52.8% of J & F on Ref-YoutubeVOS dataset and 83.2% of J_S on YoutubeVOS dataset, respectively. The code will be released.

查看译文

关键词

Video Object Segmentation,Multiple Modalities,Reinforcement Learning

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要