谷歌浏览器插件
订阅小程序
在清言上使用

Multi-Modal Transformer with Skeleton and Text for Action Recognition

Lijuan Zhou, Xuri Jiao

2024 International Joint Conference on Neural Networks (IJCNN)(2024)

引用 0|浏览0
暂无评分
摘要
Dynamic skeleton data has been widely used for human action recognition due to its high-level semantic information and environmental robustness, represented as the 2D/3D coordinates of human joints. However, previous methods mostly utilized skeleton data only without considering the crucial role of text information in helping machines understand visual contents. This paper proposes a novel method based on multi-modal Transformer with skeleton and text (namely MMT-ST) for action recognition. The proposed method performs action captioning and recognition tasks simultaneously, which dynamically updates action recognition based on the results of action captioning. MMT-ST employs a transformer as the backbone and consists of four components: two single-modal encoders, a cross encoder, and a decoder. The single-modal encoders respectively embed skeletons and texts. The cross encoder aims to learn the underlying correlations between two modalities and further perform action recognition task through a classification head. The decoder is employed to conduct the action captioning task. Additionally, a two-stage training strategy is employed to ensure smoother model training. Extensive experiments conducted on NTU RGB+D, NTU RGB+D 120 and ETRI-Activity 3D datasets demonstrate the effectiveness of the proposed method.
更多
查看译文
关键词
multi-modal,skeleton sequences,text-supervised,action recognition
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要