Dual DETRs for Multi-Label Temporal Action Detection
CVPR 2024(2024)
摘要
Temporal Action Detection (TAD) aims to identify the action boundaries and
the corresponding category within untrimmed videos. Inspired by the success of
DETR in object detection, several methods have adapted the query-based
framework to the TAD task. However, these approaches primarily followed DETR to
predict actions at the instance level (i.e., identify each action by its center
point), leading to sub-optimal boundary localization. To address this issue, we
propose a new Dual-level query-based TAD framework, namely DualDETR, to detect
actions from both instance-level and boundary-level. Decoding at different
levels requires semantics of different granularity, therefore we introduce a
two-branch decoding structure. This structure builds distinctive decoding
processes for different levels, facilitating explicit capture of temporal cues
and semantics at each level. On top of the two-branch design, we present a
joint query initialization strategy to align queries from both levels.
Specifically, we leverage encoder proposals to match queries from each level in
a one-to-one manner. Then, the matched queries are initialized using position
and content prior from the matched action proposal. The aligned dual-level
queries can refine the matched proposal with complementary cues during
subsequent decoding. We evaluate DualDETR on three challenging multi-label TAD
benchmarks. The experimental results demonstrate the superior performance of
DualDETR to the existing state-of-the-art methods, achieving a substantial
improvement under det-mAP and delivering impressive results under seg-mAP.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要