EmoCLIP: A Vision-Language Method for Zero-Shot Video Facial Expression Recognition
arxiv(2023)
摘要
Facial Expression Recognition (FER) is a crucial task in affective computing,
but its conventional focus on the seven basic emotions limits its applicability
to the complex and expanding emotional spectrum. To address the issue of new
and unseen emotions present in dynamic in-the-wild FER, we propose a novel
vision-language model that utilises sample-level text descriptions (i.e.
captions of the context, expressions or emotional cues) as natural language
supervision, aiming to enhance the learning of rich latent representations, for
zero-shot classification. To test this, we evaluate using zero-shot
classification of the model trained on sample-level descriptions on four
popular dynamic FER datasets. Our findings show that this approach yields
significant improvements when compared to baseline methods. Specifically, for
zero-shot video FER, we outperform CLIP by over 10% in terms of Weighted
Average Recall and 5% in terms of Unweighted Average Recall on several
datasets. Furthermore, we evaluate the representations obtained from the
network trained using sample-level descriptions on the downstream task of
mental health symptom estimation, achieving performance comparable or superior
to state-of-the-art methods and strong agreement with human experts. Namely, we
achieve a Pearson's Correlation Coefficient of up to 0.85 on schizophrenia
symptom severity estimation, which is comparable to human experts' agreement.
The code is publicly available at: https://github.com/NickyFot/EmoCLIP.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要