Abstract
The aim of remote sensing image captioning (RSIC) is to obtain insightful and detailed textual description of satellite images and aerial images. However, traditional methods are not able to achieve this aim effectively due to a lack of contextual awareness caused by variations in scale, viewpoint and scene complexity. In this paper, we propose a method, the Multiscale Region-Aware Captioning Network (MSR-CapNet), which helps to achieve the aim of RSIC by generating relevant and semantically correct textual descriptions for scenes in satellite images (and aerial images). We train and test our method for the purpose of RSIC on the RSICD and UCM caption datasets. In our MSR-CapNet method, we have integrated Feature Pyramid Encoding (used for local and global visual characteristics representation), Adaptive Attention (which helps in dynamic prioritization of relevant regions) and Topic-Sensitive Embeddings (to generate semantically consistent captions). To show the effectiveness of the proposed method (MSR-CapNet), we compared it with existing techniques (recent transformer and graph-based baselines) using BLEU-4, METEOR, and CIDEr measures, where it shows consistent improvement over existing techniques.
References
Xu, Kelvin, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention." In International conference on machine learning, PMLR, 2015, 2048-2057.
Liu, Chenyang, Rui Zhao, Hao Chen, Zhengxia Zou, and Zhenwei Shi. "Remote Sensing Image Change Captioning with Dual-Branch Transformers: A New Method and A Large Scale Dataset." IEEE Transactions on Geoscience and Remote Sensing 60 (2022): 1-20.
Zou, Shiwei, Yingmei Wei, Yuxiang Xie, and Xidao Luan. "Frequency–Spatial–Temporal Domain Fusion Network for Remote Sensing Image Change Captioning." Remote Sensing 17, no. 8 (2025): 1463.
Anderson, Peter, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. "Bottom-up and Top-down Attention for Image Captioning and Visual Question Answering." In Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, 6077-6086.
Lu, Jiasen, Caiming Xiong, Devi Parikh, and Richard Socher. "Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning." In Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, 375-383.
Cheng, Kangda, Jinlong Liu, Rui Mao, Zhilu Wu, and Erik Cambria. "CSA-RSIC: Cross-Modal Semantic Alignment for Remote Sensing Image Captioning." IEEE Geoscience and Remote Sensing Letters (2025).
Lin, Tsung-Yi, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. "Feature Pyramid Networks for Object Detection." In Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, 2117-2125.
Liu, Ze, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows." In Proceedings of the IEEE/CVF international conference on computer vision, 2021, 10012-10022.
Wang, Wenhai, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. "Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction Without Convolutions." In Proceedings of the IEEE/CVF international conference on computer vision, 2021, 568-578.
Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. "Attention is All You Need." Advances in neural information processing systems 30 (2017).
Gururangan, Suchin, Ana Marasović, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A. Smith. "Don't Stop Pretraining: Adapt Language Models to Domains and Tasks." arXiv preprint arXiv:2004.10964 (2020).
He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. "Deep Residual Learning for Image Recognition." In Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, 770-778.
Xu, Zhiyong, Weicun Zhang, Tianxiang Zhang, Zhifang Yang, and Jiangyun Li. "Efficient Transformer for Remote Sensing Image Segmentation." Remote Sensing 13, no. 18 (2021): 3585.
Li, Hanqian, Ruinan Zhang, Ye Pan, Junchi Ren, and Fei Shen. "Lr-fpn: Enhancing Remote Sensing Object Detection with Location Refined Feature Pyramid Network." In 2024 International Joint Conference on Neural Networks (IJCNN), IEEE, 2024, 1-8.
Chen, Long, Hanwang Zhang, Jun Xiao, Liqiang Nie, Jian Shao, Wei Liu, and Tat-Seng Chua. "Sca-cnn: Spatial and Channel-wise Attention in Convolutional Networks for Image Captioning." In Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, 5659-5667.
Huang, Wei, Qi Wang, and Xuelong Li. "Denoising-based Multiscale Feature Fusion for Remote Sensing Image Captioning." IEEE Geoscience and Remote Sensing Letters 18, no. 3 (2020): 436-440.
Liu, Chenyang, Jiafan Zhang, Keyan Chen, Man Wang, Zhengxia Zou, and Zhenwei Shi. "Remote Sensing Spatiotemporal Vision–Language Models: A Comprehensive Survey." IEEE Geoscience and Remote Sensing Magazine (2025).
Zhou, Luowei, Hamid Palangi, Lei Zhang, Houdong Hu, Jason Corso, and Jianfeng Gao. "Unified Vision-Language Pre-Training for Image Captioning and Vqa." In Proceedings of the AAAI conference on artificial intelligence, vol. 34, no. 07, 2020, 13041-13049.
Zhao, Beigeng. "A Systematic Survey of Remote Sensing Image Captioning." IEEE Access 9 (2021): 154086-154111.
Chen, Jie, Xinyi Dai, Ya Guo, Jingru Zhu, Xiaoming Mei, Min Deng, and Geng Sun. "Urban Built Environment Assessment based on Scene Understanding of High-Resolution Remote Sensing Imagery." Remote Sensing 15, no. 5 (2023): 1436.
Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. "Bleu: A Method for Automatic Evaluation of Machine Translation." In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 2002, 311-318.
Banerjee, Satanjeev, and Alon Lavie. "METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments." In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, 2005, 65-72.
Vedantam, Ramakrishna, C. Lawrence Zitnick, and Devi Parikh. "Cider: Consensus-based Image Description Evaluation." In Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, 4566-4575.
Cheng, Gong, Junwei Han, and Xiaoqiang Lu. "Remote Sensing Image Scene Classification: Benchmark and State of the Art." Proceedings of the IEEE 105, no. 10 (2017): 1865-1883.
Rennie, Steven J., Etienne Marcheret, Youssef Mroueh, Jerret Ross, and Vaibhava Goel. "Self-critical Sequence Training for Image Captioning." In Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, 7008-7024.
Zeng, Yan, Xinsong Zhang, Hang Li, Jiawei Wang, Jipeng Zhang, and Wangchunshu Zhou. "X $^{2} $2-VLM: All-in-One Pre-Trained Model for Vision-Language Tasks." IEEE transactions on pattern analysis and machine intelligence 46, no. 5 (2023): 3156-3168.
Li, Chenliang, Haiyang Xu, Junfeng Tian, Wei Wang, Ming Yan, Bin Bi, Jiabo Ye et al. "mplug: Effective and Efficient Vision-Language Learning by Cross-Modal Skip-Connections." In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022, 7241-7259.
Wang, Peng, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. "Ofa: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework." In International conference on machine learning, PMLR, 2022, 23318-23340.
Li, Junnan, Dongxu Li, Caiming Xiong, and Steven Hoi. "Blip: Bootstrapping Language-Image Pre-Training for Unified Vision-Language Understanding and Generation." In International conference on machine learning, PMLR, 2022, 12888-12900.
Brown, Tom, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal, Arvind Neelakantan et al. "Language Models are Few-Shot Learners." Advances in neural information processing systems 33 (2020): 1877-1901.
Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. "Bert: Pre-Training of Deep Bidirectional Transformers for Language Understanding." In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), 2019, 4171-4186.
Touvron, Hugo, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière et al. "Llama: Open and Efficient Foundation Language Models." arXiv preprint arXiv:2302.13971 (2023).
Cornia, Marcella, Matteo Stefanini, Lorenzo Baraldi, and Rita Cucchiara. "Meshed-Memory Transformer for Image Captioning." In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, 10578-10587.
Wang, Yuduo, Weikang Yu, and Pedram Ghamisi. "Change Captioning in Remote Sensing: Evolution to SAT-Cap--A Single-Stage Transformer Approach." arXiv preprint arXiv:2501.08114 (2025).
Yang, Cong, Zuchao Li, and Lefei Zhang. "Bootstrapping Interactive Image–Text Alignment for Remote Sensing Image Captioning." IEEE Transactions on Geoscience and Remote Sensing 62 (2024): 1-12.
Zou, Shiwei, Yingmei Wei, Yuxiang Xie, and Xidao Luan. "Frequency–Spatial–Temporal Domain Fusion Network for Remote Sensing Image Change Captioning." Remote Sensing 17, no. 8 (2025): 1463.
