Temporal-Spatial Object Relations Modeling for Vision-and-Language Navigation
arxiv(2024)
摘要
Vision-and-Language Navigation (VLN) is a challenging task where an agent is
required to navigate to a natural language described location via vision
observations. The navigation abilities of the agent can be enhanced by the
relations between objects, which are usually learned using internal objects or
external datasets. The relationships between internal objects are modeled
employing graph convolutional network (GCN) in traditional studies. However,
GCN tends to be shallow, limiting its modeling ability. To address this issue,
we utilize a cross attention mechanism to learn the connections between objects
over a trajectory, which takes temporal continuity into account, termed as
Temporal Object Relations (TOR). The external datasets have a gap with the
navigation environment, leading to inaccurate modeling of relations. To avoid
this problem, we construct object connections based on observations from all
viewpoints in the navigational environment, which ensures complete spatial
coverage and eliminates the gap, called Spatial Object Relations (SOR).
Additionally, we observe that agents may repeatedly visit the same location
during navigation, significantly hindering their performance. For resolving
this matter, we introduce the Turning Back Penalty (TBP) loss function, which
penalizes the agent's repetitive visiting behavior, substantially reducing the
navigational distance. Experimental results on the REVERIE, SOON, and R2R
datasets demonstrate the effectiveness of the proposed method.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要