A Multiscale Grouping Transformer with CLIP Latents for Remote Sensing Image Captioning

Lingwu Meng, Jing Wang, Ran Meng,Yang Yang,Liang Xiao

IEEE Transactions on Geoscience and Remote Sensing(2024)

引用 0|浏览2
暂无评分
摘要
Recent progress has shown that integrating multiscale visual features with advanced Transformer architectures is a promising approach for remote sensing image captioning (RSIC). However, the lack of local modeling ability in self-attention may potentially lead to inaccurate contextual information. Moreover, the scarcity of trainable image-caption pairs poses challenges in effectively harnessing the semantic alignment between images and texts. To mitigate these issues, we propose a Multiscale Grouping Transformer with Contrastive Language-Image Pre-training (CLIP) latents (MG-Transformer) for RSIC. First of all, a CLIP image embedding and a set of region features are extracted within a Multi-level Feature Extraction module. To achieve a comprehensive image representation, a Semantic Correlation module is designed to integrate the image embedding and region features with an attention gate. Subsequently, the integrated image features are fed into a Transformer model. The Transformer encoder utilizes dilated convolutions with different dilation rates to obtain multiscale visual features. To enhance the local modeling ability of the self-attention mechanism in the encoder, we introduce a Global Grouping Attention mechanism. This mechanism incorporates a grouping operation into self-attention, allowing each attention head to focus on different contextual information. The Transformer decoder then adopts the Meshed Cross-Attention mechanism to establish relationships between various scales of visual features and text features. This facilitates the generation of captions for images by the decoder. Experimental results on three RSIC datasets demonstrate the superiority of the proposed MG-Transformer. The code will be publicly available at https://github.com/One-paper-luck/MG-Transformer.
更多
查看译文
关键词
Remote sensing image captioning,Transformer,CLIP,multiscale,Grouping
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要