A Dual-Feature-Based Adaptive Shared Transformer Network for Image Captioning

Yinbin Shi, Ji Xia,MengChu Zhou,Zhengcai Cao

IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT(2024)

引用 0|浏览4
暂无评分
摘要
Current models exhibit notable efficacy in image-captioning tasks. Mainstream research shows that combining dual visual features enhances visual representations and brings a performance boost. However, the incorporation of dual visual features complicates computation and expands parameters, hindering streamlined model deployment. The selection of region features requires a pretrained object detector, neglecting the model's ease of use for new scenarios and data. In this article, we propose a dual-feature adaptive shared transformer network, capitalizing on the merits of grid and shallow patch features, while circumventing the extra complexity from dual channels. Specifically, we eschew complex features such as region features to facilitate straightforward dataset compilation and expedite inference. We propose an adaptive shared transformer block (AST) to conserve parameters and diminish the model's FLOPs. A gating mechanism is employed to adaptively compute the importance of each feature, thereby obtaining stronger visual features. Since using flattening grid features before a transformer often leads to a loss of crucial spatial information, we incorporate the learning of relative geometric information based on grid features into our proposed method. Our analysis of various feature fusion techniques reveals that the AST approach outperforms its counterparts in terms of FLOPs and model size while still achieving high performance. Extensive experiments on different datasets indicate that our model demonstrates competitive performance on MSCOCO and outperforms state-of-the-art models on small-scale datasets.
更多
查看译文
关键词
Deep learning,image captioning,transformer
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要