Holistic-Guided Disentangled Learning With Cross-Video Semantics Mining for Concurrent First-Person and Third-Person Activity Recognition

Tianshan Liu,Rui Zhao,Wenqi Jia,Kin-Man Lam,Jun Kong

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS（2024）

Cited 2|Views15

No score

Abstract

The popularity of wearable devices has increased the demands for the research on first-person activity recognition. However, most of the current first-person activity datasets are built based on the assumption that only the human-object interaction (HOI) activities, performed by the camera-wearer, are captured in the field of view. Since humans live in complicated scenarios, in addition to the first-person activities, it is likely that third-person activities performed by other people also appear. Analyzing and recognizing these two types of activities simultaneously occurring in a scene is important for the camera-wearer to understand the surrounding environments. To facilitate the research on concurrent first-and third-person activity recognition (CFT-AR), we first created a new activity dataset, namely PolyU concurrent first-and third-person (CFT) Daily, which exhibits distinct properties and challenges, compared with previous activity datasets. Since temporal asynchronism and appearance gap usually exist between the first-and third-person activities, it is crucial to learn robust representations from all the activity-related spatio-temporal positions. Thus, we explore both holistic scene-level and local instance-level (person-level) features to provide comprehensive and discriminative patterns for recognizing both first-and third-person activities. On the one hand, the holistic scene-level features are extracted by a 3-D convolutional neural network, which is trained to mine shared and sample-unique semantics between video pairs, via two well-designed attention-based modules and a self-knowledge distillation (SKD) strategy. On the other hand, we further leverage the extracted holistic features to guide the learning of instance-level features in a disentangled fashion, which aims to discover both spatially conspicuous patterns and temporally varied, yet critical, cues. Experimental results on the PolyU CFT Daily dataset validate that our method achieves the state-of-the-art performance.

Translated text

Key words

Feature extraction,Activity recognition,Semantics,Task analysis,Training,Hardware design languages,Aggregates,Concurrent first-and third-person activity recognition (CFT-AR),cross-video semantics mining (CVSM),holistic-guided disentangled learning (HDL),self-knowledge distillation (SKD)

AI Read Science

Must-Reading Tree

Example

Generate MRT to find the research sequence of this paper

Chat Paper

Summary is being generated by the instructions you defined