Gesture Command Recognition Using Multi-Modal Attention Fusion from RGB and Thermal Image Streams
PDF

Keywords

Multimodal Gesture Recognition
Guided Attention Mechanism
Thermal-RGB Fusion
Temporal Sequence Modeling
Dual-Encoder Architecture
Modality Confidence Estimation

How to Cite

B., Padmavathi, Aarthi Elaveini M., Kapileswar N., Judy Simon, and Reshma P Vengaloor. 2025. “Gesture Command Recognition Using Multi-Modal Attention Fusion from RGB and Thermal Image Streams”. Journal of Innovative Image Processing 7 (2): 388-419. https://doi.org/10.36548/jiip.2025.2.007.

Abstract

Gesture recognition serves as a vital interface in human-machine communication, enabling systems to interpret and respond to user commands through natural body movements, particularly hand gestures. In the development of smart environments, assistive systems, and augmented reality applications, accurate and real-time gesture interpretation is essential. However, gesture recognition faces several challenges, including variations in lighting, background complexity, hand occlusions, and the temporal dynamics of human gestures. Existing approaches primarily depends on RGB data, making them susceptible to environmental noise and fluctuations in illumination. Additionally, some existing methods are ineffective in modeling temporal dependencies, resulting in decreased recognition reliability. To address these limitations, this research proposes a novel architecture, DMT-GAFNet, designed to enhance gesture command recognition by integrating dual-modality encoding with a guided attention fusion model. The model incorporates parallel encoders for RGB and thermal streams, alongside a modality confidence estimator that dynamically weights features based on input reliability. A lightweight GRU-based temporal encoder ensures effective sequential modeling of gestures. The system was experimentally validated on a dataset combining HaGRID RGB data and Zenodo thermal data, encompassing six gesture classes and diverse visual conditions. Comparative analysis with existing deep learning models, including CNN-LSTM, MobileNetV2, ResNet18, EfficientNetB0, and VGG16, demonstrates that the proposed model outperforms these alternatives, achieving a precision of 0.9399, recall of 0.9484, F1-score of 0.9493, specificity of 0.9523, and accuracy of 97.05%. The proposed method not only achieves high classification accuracy under varying conditions but also exhibits significant potential for deployment in real-time gesture-based interaction systems.

PDF

References

Abdirahman Osman Hashi, Siti Zaiton Mohd Hashim, and Azurah Bte Asama, “A Systematic Review of Hand Gesture Recognition: An Update From 2018 to 2024,” IEEE Access, vol. 12, 2024, 143599- 143626.

Meng, Yuting, Haibo Jiang, Nengquan Duan, and Haijun Wen. "Real-Time Hand Gesture Monitoring Model Based on MediaPipe’s Registerable System." Sensors 24, no. 19 (2024): 6262.

Rahman, Md Mijanur, Ashik Uzzaman, Fatema Khatun, Md Aktaruzzaman, and Nazmul Siddique. "A comparative study of advanced technologies and methods in hand gesture analysis and recognition systems." Expert Systems with Applications (2024): 125929.

Sarma, Debajit, and Manas Kamal Bhuyan. "Methods, databases and recent advancement of vision-based hand gesture recognition for hci systems: A review." SN Computer Science 2, no. 6 (2021): 436.

Brenner, Martin, Napoleon H. Reyes, Teo Susnjak, and Andre LC Barczak. "RGB-D and thermal sensor fusion: A systematic literature review." IEEE Access 11 (2023): 82410-82442.

Qi, Jing, Li Ma, Zhenchao Cui, and Yushu Yu. "Computer vision-based hand gesture recognition for human-robot interaction: a review." Complex & Intelligent Systems 10, no. 1 (2024): 1581-1606.

Bhushan, Shashi, Mohammed Alshehri, Ismail Keshta, Ashish Kumar Chakraverti, Jitendra Rajpurohit, and Ahed Abugabah. "An experimental analysis of various machine learning algorithms for hand gesture recognition." Electronics 11, no. 6 (2022): 968.

Reddy, Veluru Karthik, and Vanapalli Durga Prasanth. "Hand Gesture Recognition Using Convolutional Neural Networks." (2024).

Toro-Ossaba, Alejandro, Juan Jaramillo-Tigreros, Juan C. Tejada, Alejandro Peña, Alexandro López-González, and Rui Alexandre Castanho. "LSTM recurrent neural network for hand gesture recognition using EMG signals." Applied Sciences 12, no. 19 (2022): 9700.

Ur Rehman, Muneeb, Fawad Ahmed, Muhammad Attique Khan, Usman Tariq, Faisal Abdulaziz Alfouzan, Nouf M Alzahrani, and Jawad Ahmad. "Dynamic hand gesture recognition using 3D-CNN and LSTM networks." Computers, Materials & Continua 70, no. 3 (2021).

Kapileswar, Nellore, Judy Simon, Kota Sirisha, Bezawada Raja Pujitha, Lekkala Charan Sai Kumar, and Chappagadda Harish. "Enhanced Agricultural Monitoring Through Hyperspectral Imaging and Advanced Machine Learning Techniques." In 2024 Second International Conference on Intelligent Cyber Physical Systems and Internet of Things (ICoICI), IEEE, 2024, 1495-1502.

Padmavathi, B., K. R. Sushkrutha, Judy Simon, M. Aarthi Elaveini, and N. Kapileswar. "Implementation of a Health Monitoring Sytem using Sensors and RedTacton." In 2023 Third International Conference on Ubiquitous Computing and Intelligent Information Systems (ICUIS), IEEE, 2023, 384-390.

Oleh, Ugonna, Roman Obermaisser, and Abu Shad Ahammed. "A Review of Recent Techniques for Human Activity Recognition: Multimodality, Reinforcement Learning, and Language Models." Algorithms 17, no. 10 (2024): 434.

Zhang, Zhi-Yuan, Hao Ren, Hao Li, Kang-Hui Yuan, and Chu-Feng Zhu. "Static gesture recognition based on thermal imaging sensors." The Journal of Supercomputing 81, no. 4 (2025): 1-21.

Kumar, Ushus S., Judy Simon, Reshma P. Vengaloor, and M. Aarthi Elaveini. "Image Processing Techniques in Thermal and Non-thermal Images." In Second International Conference on Image Processing and Capsule Networks: ICIPCN 2021 2, Springer International Publishing, 2022, 533-544.

Mukhanov, Samat, Raissa Uskenbayeva, Abd A. Rakhim, Akbota Akim, and Symbat Mamanova. "Gesture recognition of the Kazakh alphabet based on machine and deep learning models." Procedia Computer Science 241 (2024): 458-463.

Alteaimi, Amal, and Mohamed Ben Othman. "Robust Interactive Method for Hand Gestures Recognition Using Machine Learning." Computers, Materials & Continua. 72 (2022): 577-595.

Shin, Jungpil, Md Al Mehedi Hasan, Md Maniruzzaman, Taiki Watanabe, and Issei Jozume. "Dynamic Hand Gesture-Based Person Identification Using Leap Motion and Machine Learning Approaches." Computers, Materials & Continua 79, no. 1 (2024).

Kapileswar, N., Judy Simon, K. Kavitha Devi, Phani Kumar Polasi, Dasari Naga Vinod, and Chappagadda Harish. "An Intelligent Emotion Recognition System based on Speech Terminologies using Artificial Intelligence Assisted Learning Scheme." In 2024 Ninth International Conference on Science Technology Engineering and Mathematics (ICONSTEM), IEEE, 2024, 1-7.

Alashhab, Samer, Antonio Javier Gallego, and Miguel Ángel Lozano. "Efficient gesture recognition for the assistance of visually impaired people using multi-head neural networks." Engineering Applications of Artificial Intelligence 114 (2022): 105188.

Mohyuddin, Hassan, Syed Kumayl Raza Moosavi, Muhammad Hamza Zafar, and Filippo Sanfilippo. "A comprehensive framework for hand gesture recognition using hybrid-metaheuristic algorithms and deep learning models." Array 19 (2023): 100317.

Oloyede, Muhtahir O., Gerhard P. Hancke, and Nellore Kapileswar. "Evaluating the effect of occlusion in face recognition systems." In 2017 IEEE AFRICON, IEEE, 2017, 1547-1551.

Lamaakal, Ismail, Khalid El Makkaoui, Ibrahim Ouahbi, and Yassine Maleh. "A TinyML model for gesture-based air handwriting Arabic numbers recognition." Procedia Computer Science 236 (2024): 589-596.

Terreran, Matteo, Leonardo Barcellona, and Stefano Ghidoni. "A general skeleton-based action and gesture recognition framework for human–robot collaboration." Robotics and Autonomous Systems 170 (2023): 104523.

Kapileswar, Nellore, Palepu V. Santhi, Vijay KR Chenchela, and CH Venkata Siva Prasad. "A fast information dissemination system for emergency services over vehicular ad hoc networks." In 2017 International Conference on Energy, Communication, Data Analytics and Soft Computing (ICECDS), IEEE, 2017, 236-241.

Rizwan, Muhammad, Sana Ul Haq, Noor Gul, Muhammad Asif, Syed Muslim Shah, Tariqullah Jan, and Naveed Ahmad. "Appearance Based Dynamic Hand Gesture Recognition Using 3D Separable Convolutional Neural Network." Computers, Materials & Continua 76, no. 1 (2023).

Zhou, Benjia, Jun Wan, Yanyan Liang, and Guodong Guo. "Adaptive cross-fusion learning for multi-modal gesture recognition." Virtual Reality & Intelligent Hardware 3, no. 3 (2021): 235-247.

https://www.kaggle.com/datasets/kapitanov/hagrid.

https://zenodo.org/records/10393655.