CrossGLG: LLM Guides One-shot Skeleton-based 3D Action Recognition in a Cross-level Manner
arxiv(2024)
摘要
Most existing one-shot skeleton-based action recognition focuses on raw
low-level information (e.g., joint location), and may suffer from local
information loss and low generalization ability. To alleviate these, we propose
to leverage text description generated from large language models (LLM) that
contain high-level human knowledge, to guide feature learning, in a
global-local-global way. Particularly, during training, we design 2 prompts
to gain global and local text descriptions of each action from an LLM. We first
utilize the global text description to guide the skeleton encoder focus on
informative joints (i.e.,global-to-local). Then we build non-local interaction
between local text and joint features, to form the final global representation
(i.e., local-to-global). To mitigate the asymmetry issue between the training
and inference phases, we further design a dual-branch architecture that allows
the model to perform novel class inference without any text input, also making
the additional inference cost neglectable compared with the base skeleton
encoder. Extensive experiments on three different benchmarks show that CrossGLG
consistently outperforms the existing SOTA methods with large margins, and the
inference cost (model size) is only 2.8% than the previous SOTA. CrossGLG
can also serve as a plug-and-play module that can substantially enhance the
performance of different SOTA skeleton encoders with a neglectable cost during
inference. The source code will be released soon.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要