EchoTrack: Auditory Referring Multi-Object Tracking for Autonomous Driving
CoRR(2024)
摘要
This paper introduces the task of Auditory Referring Multi-Object Tracking
(AR-MOT), which dynamically tracks specific objects in a video sequence based
on audio expressions and appears as a challenging problem in autonomous
driving. Due to the lack of semantic modeling capacity in audio and video,
existing works have mainly focused on text-based multi-object tracking, which
often comes at the cost of tracking quality, interaction efficiency, and even
the safety of assistance systems, limiting the application of such methods in
autonomous driving. In this paper, we delve into the problem of AR-MOT from the
perspective of audio-video fusion and audio-video tracking. We put forward
EchoTrack, an end-to-end AR-MOT framework with dual-stream vision transformers.
The dual streams are intertwined with our Bidirectional Frequency-domain
Cross-attention Fusion Module (Bi-FCFM), which bidirectionally fuses audio and
video features from both frequency- and spatiotemporal domains. Moreover, we
propose the Audio-visual Contrastive Tracking Learning (ACTL) regime to extract
homogeneous semantic features between expressions and visual objects by
learning homogeneous features between different audio and video objects
effectively. Aside from the architectural design, we establish the first set of
large-scale AR-MOT benchmarks, including Echo-KITTI, Echo-KITTI+, and Echo-BDD.
Extensive experiments on the established benchmarks demonstrate the
effectiveness of the proposed EchoTrack model and its components. The source
code and datasets will be made publicly available at
https://github.com/lab206/EchoTrack.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要