MMToM-QA: Multimodal Theory of Mind Question Answering
CoRR(2024)
摘要
Theory of Mind (ToM), the ability to understand people's minds, is an
essential ingredient for developing machines with human-level social
intelligence. Recent machine learning models, particularly large language
models, seem to show some aspects of ToM understanding. However, existing ToM
benchmarks use unimodal datasets - either video or text. Human ToM, on the
other hand, is more than video or text understanding. People can flexibly
reason about another person's mind based on conceptual representations (e.g.,
goals, beliefs, plans) extracted from any available data, which can include
visual cues, linguistic narratives, or both. To address this, we introduce a
multimodal Theory of Mind question answering (MMToM-QA) benchmark. MMToM-QA
comprehensively evaluates machine ToM both on multimodal data and on different
kinds of unimodal data about a person's activity in a household environment. To
engineer multimodal ToM capacity, we propose a novel method, BIP-ALM (Bayesian
Inverse Planning Accelerated by Language Models). BIP-ALM extracts unified
representations from multimodal data and utilizes language models for scalable
Bayesian inverse planning. We conducted a systematic comparison of human
performance, BIP-ALM, and state-of-the-art models, including GPT-4. The
experiments demonstrate that large language models and large multimodal models
still lack robust ToM capacity. BIP-ALM, on the other hand, shows promising
results, by leveraging the power of both model-based mental inference and
language models.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要