Hierarchical Sparse Vision Transformers for Real-Time Drone-Based Object Detection

Vivekanandam B.

doi:10.36548/jaicn.2026.2.006

Hierarchical Sparse Vision Transformers for Real-Time Drone-Based Object Detection

Open Access

https://doi.org/10.36548/jaicn.2026.2.006

Vol. 8, No. 2 (2026)

Published: 26 May, 2026

Pages: 147-165

Vivekanandam B. Vivekanandam B.

Associate Professor, School of AI Computing and Multimedia, Lincoln University College, Malaysia

Associate Professor, School of AI Computing and Multimedia, Lincoln University College, Malaysia

view PDF

How to Cite

B., Vivekanandam. 2026. “Hierarchical Sparse Vision Transformers for Real-Time Drone-Based Object Detection”. Journal of Artificial Intelligence and Capsule Networks 8 (2): 147-65. https://doi.org/10.36548/jaicn.2026.2.006.

Keywords

Scalable Attention

Vision Transformer

Sparse Self-Attention

Aerial Object Detection

Multi-Scale Fusion

Real-Time Detection

Abstract

Object detectors based on transformers offer high contextual modeling but have a quadratic complexity of attention, which has restricted their application to real-time in aerial settings. The proposal presented in this paper is a Scalable Adaptive Hierarchical Attention Transformer (SAHAT-Det) that is proposed to be effective at detecting objects and objects in drone imagery. The framework presents the concept of dynamic relevance-based token scoring, top K sparse attention calculation, and adaptive token pruning to lower the computational cost. A multi-scale hierarchy fusion module retains small-scale spatial details especially of objects that are small and far away. On the VisDrone dataset, experimental results show a higher mAP of 0.5:0.95 and small-object detection accuracy than the latest CNN and transformer-based baselines that are run under the same configuration. Although there is less attention computation, the suggested model still retains close real-time inference speed. The qualitative analysis also proves enhanced localization stability in high density urbanized scenes. The obtained results show that adaptive sparse attention offers an efficient compromise between the accuracy of detection and processing cost in real-time aerial object detection.

References

Shehzadi, Tahira, Khurram Azeem Hashmi, Marcus Liwicki, Didier Stricker, and Muhammad Zeshan Afzal. "Object Detection with Transformers: A Review." Sensors 2025, vol. 25, no. 19: 6025.
Li, Yong, Naipeng Miao, Liangdi Ma, Feng Shuang, and Xingwen Huang. "Transformer for Object Detection: Review and Benchmark." Engineering Applications of Artificial Intelligence 2023, vol. 126: 107021.
Qi, Shuaihui, Xiaofeng Song, Tongfei Shang, Xiaochang Hu, and Kun Han. "Msfe-yolo: An Improved Yolov8 Network for Object Detection on Drone View." IEEE Geoscience and Remote Sensing Letters 2024, vol. 21, 1-5.
Abu-Khadrah, Ahmed, Ahmad Al-Qerem, Mohammad R. Hassan, Ali Mohd Ali, and Muath Jarrah. "Drone-Assisted Adaptive Object Detection and Privacy-Preserving Surveillance in Smart Cities Using Whale-Optimized Deep Reinforcement Learning Techniques." Scientific Reports 2025, vol. 15, no. 1: 9931.
Ye, Yanming, Qiang Sun, Kailong Cheng, Xingfa Shen, and Dongjing Wang. "A Lightweight Mechanism for Vision-Transformer-Based Object Detection." Complex & Intelligent Systems 2025, vol. 11, no. 7: 302.
Lou, Haitong, Xuehu Duan, Junmei Guo, Haiying Liu, Jason Gu, Lingyun Bi, and Haonan Chen. "DC-YOLOv8: Small-Size Object Detection Algorithm Based on Camera Sensor." Electronics 2023, vol. 12, no. 10: 2323.
Wang, Yong, Qian Wang, Rui Zou, Feng Wen, Feng Liu, Yang Zhang, Shuang Du, and Wei Zeng. “Advancing Image Object Detection: Enhanced Feature Pyramid Network and Gradient Density Loss for Improved Performance.” Applied Sciences 2023, vol. 13, no. 22: 12174.
Aboghanem, Ahmed, Mohamed Abd Elfattah, H. M. Amer, and A. T. Khalil. “A Hybrid ResNet50-Vision Transformer Model with an Attention Mechanism for Aerial Image Classification.” Scientific Reports 2026, vol. 16: 5940.
Khan, A., Z. Rauf, A. Sohail, A. R. Khan, H. Asif, A. Asif, and U. Farooq. “A Survey of the Vision Transformers and Their CNN-Transformer Based Variants.” Artificial Intelligence Review 2023, vol. 56, no. S3, 2917–2970.
Wang, Yong, Jun Zhang, and Jian Zhou. “Urban Traffic Tiny Object Detection via Attention and Multi-Scale Feature Driven in UAV-Vision.” Scientific Reports 2024, vol. 14: 20614.
Cao, Jian, Bin Peng, Ming Gao, Hao Hao, Li Li, Xiang Li, and Hui Mou. “Object Detection Based on CNN and Vision-Transformer: A Survey.” IET Computer Vision 2025, vol. 19, no. 1: e70028.
Zhang, Wei, and Ying Yang. “FoT: An Efficient Transformer Framework for Real-Time Small Object Detection in Football Videos.” Scientific Reports 2025, vol. 15: 30875.
Khan, Salman, Muhammad Naseer, Munawar Hayat, Syed Waqas Zamir, Fahad Shahbaz Khan, and Mubarak Shah. “Transformers in Vision: A Survey.” ACM Computing Surveys 2022, vol. 54, no. 10s: 200.
Hassija, Vikas, Bhuvaneswari Palanisamy, Arindam Chatterjee, Anirban Mandal, Debashis Chakraborty, Ankit Pandey, and Deepak Kumar. “Transformers for Vision: A Survey on Innovative Methods for Computer Vision.” IEEE Access 2025, vol. 13: 3571735.
Jamil, Saad, Mohammad Jalil Piran, and Oh-Joon Kwon. “A Comprehensive Survey of Transformers for Computer Vision.” Drones 2023, vol. 7, no. 5: 287.
Wang, Yong, Yifan Deng, Yuxin Zheng, Prithwijit Chattopadhyay, and Lei Wang. “Vision Transformers for Image Classification: A Comparative Survey.” Technologies 2025, vol. 13, no. 1: 32.
Hua, Wenqi, Qing Chen, and Wei Chen. “A New Lightweight Network for Efficient UAV Object Detection.” Scientific Reports 2024, vol. 14: 13288.
Yuan, Yifan, Yu Wu, Liang Zhao, Hong Chen, and Ying Zhang. “Multiple Object Detection and Tracking from Drone Videos Based on GM-YOLO and Multitracker.” Image and Vision Computing 2024, vol. 143: 104951.
Chen, Yu, Xiaobo Gu, Zhen Liu, and Jun Liang. “A Fast Inference Vision Transformer for Automatic Pavement Image Classification and Its Visual Interpretation Method.” Remote Sensing 2022, vol. 14, no. 8: 1877.
Jamil, Saad, Mohammad S. Abbas, and A. M. Roy. “Distinguishing Malicious Drones Using Vision Transformer.” AI 2022, vol. 3, no. 2, 260–273.
Khoshnevis, S. A., and A. Amirkhani. “Tracking with Attention: A Review of Transformer-Based Object Tracking.” Engineering Science and Technology, an International Journal 2026, vol. 73: 102263.
Shehzadi, Tahira, Khurram Azeem Hashmi, Marcus Liwicki, Didier Stricker, and Muhammad Zeshan Afzal. “Object Detection with Transformers: A Review.” Sensors 2025, vol. 25, no. 19: 6025.
Lai, Nicholas, D. A. Dewi, and S. S. Maidin. “Integrating Attention Mechanisms in Multi-Scale Image Detection: A Bibliometric Analysis of Research Evolution and Frontier Trends.” SICE Journal of Control, Measurement, and System Integration 2025, vol. 18, no. 1: 2567085.
Junos, M. H., and A. S. M. Khairuddin. “YOLO-MMS for Aerial Object Detection Model Based on Hybrid Feature Extractor and Improved Multi-Scale Prediction.” The Visual Computer 2025, vol. 41, 4759–4778.
VisDrone Dataset - https://www.kaggle.com/datasets/kushagrapandya/visdrone-dataset

Hierarchical Sparse Vision Transformers for Real-Time Drone-Based Object Detection

How to Cite

Download Citation

Keywords

Abstract

References