Multi-Modal Transformer with Skeleton and Text for Action Recognition
2024 International Joint Conference on Neural Networks (IJCNN)(2024)
摘要
Dynamic skeleton data has been widely used for human action recognition due to its high-level semantic information and environmental robustness, represented as the 2D/3D coordinates of human joints. However, previous methods mostly utilized skeleton data only without considering the crucial role of text information in helping machines understand visual contents. This paper proposes a novel method based on multi-modal Transformer with skeleton and text (namely MMT-ST) for action recognition. The proposed method performs action captioning and recognition tasks simultaneously, which dynamically updates action recognition based on the results of action captioning. MMT-ST employs a transformer as the backbone and consists of four components: two single-modal encoders, a cross encoder, and a decoder. The single-modal encoders respectively embed skeletons and texts. The cross encoder aims to learn the underlying correlations between two modalities and further perform action recognition task through a classification head. The decoder is employed to conduct the action captioning task. Additionally, a two-stage training strategy is employed to ensure smoother model training. Extensive experiments conducted on NTU RGB+D, NTU RGB+D 120 and ETRI-Activity 3D datasets demonstrate the effectiveness of the proposed method.
更多查看译文
关键词
multi-modal,skeleton sequences,text-supervised,action recognition
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要