Cross Attention Based Feature Fusion Network for Robust Anomaly Detection in Surveillance Videos

Dipak Ramoliya; Amit Ganatra

doi:10.36548/jiip.2025.3.006

Cross Attention Based Feature Fusion Network for Robust Anomaly Detection in Surveillance Videos

Open Access

https://doi.org/10.36548/jiip.2025.3.006

Vol. 7, No. 3 (2025)

Published: 29 August, 2025

Pages: 679-694

Dipak Ramoliya , Dipak Ramoliya

Department of Computer Science & Engineering, Devang Patel Institute of Advance Technology and Research (DEPSTAR), Faculty of Technology and Engineering (FTE), Charotar University of Science and Technology (CHARUSAT), Changa, Anand

Department of Computer Science & Engineering, Devang Patel Institute of Advance Technology and Research (DEPSTAR), Faculty of Technology and Engineering (FTE), Charotar University of Science and Technology (CHARUSAT), Changa, Anand
Amit Ganatra Amit Ganatra

Department of Computer Science & Engineering, Devang Patel Institute of Advance Technology and Research (DEPSTAR), Faculty of Technology and Engineering (FTE), Charotar University of Science and Technology (CHARUSAT), Changa, Anand, Department of Computer Science and Engineering, Faculty of Engineering and Technology, Parul University (PU), Waghodia, Vadodara

Department of Computer Science & Engineering, Devang Patel Institute of Advance Technology and Research (DEPSTAR), Faculty of Technology and Engineering (FTE), Charotar University of Science and Technology (CHARUSAT), Changa, Anand, Department of Computer Science and Engineering, Faculty of Engineering and Technology, Parul University (PU), Waghodia, Vadodara

view PDF

How to Cite

Ramoliya, Dipak, and Amit Ganatra. 2025. “Cross Attention Based Feature Fusion Network for Robust Anomaly Detection in Surveillance Videos”. Journal of Innovative Image Processing 7 (3): 679-94. https://doi.org/10.36548/jiip.2025.3.006.

Keywords

Anomaly Detection

Computer Vision

Video Surveillance

Multimodal Learning

Attention Network

Feature Fusion

Abstract

For enhancing public safety, a surveillance system is essential. Specifically, video surveillance is the most popular way to maintain safety in public and private areas. The detection and recognition of abnormal activity is difficult due to a complex environment, video quality, and varying noise levels. Addressing the challenges of accuracy and video processing, the proposed study uses a cross-attention network with feature fusion to improve the recognition of abnormal activity in complex scenarios. Cross-attention helps to capture contextual information from different videos. The proposed model combines an innovative method of cross attention and feed-forward attention with latent space representation-based fusion, aiming to improve accuracy. The simulation of the study uses two benchmark datasets, UCF and UCSD and achieves remarkable performance with 97.1 % and 91.31 % accuracy. A simulation study has also demonstrated a comparative analysis with different convolution and attention networks for anomaly detection. This study proposes an effective video processing scheme with wide practical potential. The study also provides a new perspective and methodological basis for future research and applications in related fields.

References

Handola, V., Banerjee, A. and Kumar, V., 2009. Anomaly detection: A survey. ACM computing surveys (CSUR), 41(3),1-58.
Pang, Guansong, Chunhua Shen, Longbing Cao, and Anton Van Den Hengel. "Deep learning for anomaly detection: A review." ACM computing surveys (CSUR) 54, no. 2 (2021): 1-38.
Nassif, Ali Bou, Manar Abu Talib, Qassim Nasir, and Fatima Mohamad Dakalbab. "Machine learning for anomaly detection: A systematic review." Ieee Access 9 (2021): 78658-78700.
Hao, Yanbin, Shuo Wang, Pei Cao, Xinjian Gao, Tong Xu, Jinmeng Wu, and Xiangnan He. "Attention in attention: Modeling context correlation for efficient video classification." IEEE Transactions on Circuits and Systems for Video Technology 32, no. 10 (2022): 7120-7132.
Samariya, Durgesh, and Amit Thakkar. "A comprehensive survey of anomaly detection algorithms." Annals of Data Science 10, no. 3 (2023): 829-850.
Zamanzadeh Darban, Zahra, Geoffrey I. Webb, Shirui Pan, Charu Aggarwal, and Mahsa Salehi. "Deep learning for time series anomaly detection: A survey." ACM Computing Surveys 57, no. 1 (2024): 1-42.
Ma, Xiaoxiao, Jia Wu, Shan Xue, Jian Yang, Chuan Zhou, Quan Z. Sheng, Hui Xiong, and Leman Akoglu. "A comprehensive survey on graph anomaly detection with deep learning." IEEE transactions on knowledge and data engineering 35, no. 12 (2021): 12012-12038.
Li, Zhong, Yuxuan Zhu, and Matthijs Van Leeuwen. "A survey on explainable anomaly detection." ACM Transactions on Knowledge Discovery from Data 18, no. 1 (2023): 1-54.
Zhang, Ximiao, Min Xu, and Xiuzhuang Zhou. "Realnet: A feature selection network with realistic synthetic anomaly for anomaly detection." In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 16699-16708. 2024.
Liu, Zhikang, Yiming Zhou, Yuansheng Xu, and Zilei Wang. "Simplenet: A simple network for image anomaly detection and localization." In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 20402-20411. 2023.
Xu, Hongzuo, Guansong Pang, Yijie Wang, and Yongjun Wang. "Deep isolation forest for anomaly detection." IEEE Transactions on Knowledge and Data Engineering 35, no. 12 (2023): 12591-12604.
Yan, Shen, Haidong Shao, Zhishan Min, Jiangji Peng, Baoping Cai, and Bin Liu. "FGDAE: A new machinery anomaly detection method towards complex operating conditions." Reliability Engineering & System Safety 236 (2023): 109319.
W. Sultani, C. Chen and M. Shah, "Real-World Anomaly Detection in Surveillance Videos," 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, 6479-6488.
Doshi, Keval & Yilmaz, Yasin. (2020). Any-Shot Sequential Anomaly Detection in Surveillance Videos. 4037-4042.
S. Chandrakala, K. Deepak, and G. Revathy. 2022. Anomaly detection in surveillance videos: a thematic taxonomy of deep models, review and performance analysis. Artif. Intell. Rev. 56, 4 (Apr 2023), 3319–3368.
Raja, Rohit, Prakash Chandra Sharma, Md Rashid Mahmood, and Dinesh Kumar Saini. "Analysis of anomaly detection in surveillance video: recent trends and future vision." Multimedia Tools and Applications 82, no. 8 (2023): 12635-12651.
Kun Liu and Huadong Ma. 2019. Exploring Background-bias for Anomaly Detection in Surveillance Videos. In Proceedings of the 27th ACM International Conference on Multimedia (MM '19). Association for Computing Machinery, New York, NY, USA, 1490–1499.
Khaleghi and M. S. Moin, "Improved anomaly detection in surveillance videos based on a deep learning method," 2018 8th Conference of AI & Robotics and 10th RoboCup Iranopen International Symposium (IRANOPEN), Qazvin, Iran, 2018, 73-81.
Chen, Dongyue, Pengtao Wang, Lingyi Yue, Yuxin Zhang, and Tong Jia. "Anomaly detection in surveillance video based on bidirectional prediction." Image and Vision Computing 98 (2020): 103915.
Choudhry, Nomica, Jemal Abawajy, Shamsul Huda, and Imran Rao. "A comprehensive survey of machine learning methods for surveillance videos anomaly detection." IEEE Access 11 (2023): 114680-114713.
Xu, Jiehui, Haixu Wu, Jianmin Wang, and Mingsheng Long. "Anomaly transformer: Time series anomaly detection with association discrepancy." arXiv preprint arXiv:2110.02642 (2021).
Kothadiya, Deep R., Chintan Bhatt, Aayushi Chaudhari, and Nilkumar Sinojiya. "GujFormer: A vision transformer-based architecture for Gujarati handwritten character recognition." In International Conference on Advances in Data-driven Computing and Intelligent Systems, Singapore: Springer Nature Singapore, 2023, 89-101.
Ahn, Dasom, Sangwon Kim, Hyunsu Hong, and Byoung Chul Ko. "Star-transformer: a spatio-temporal cross attention transformer for human action recognition." In Proceedings of the IEEE/CVF winter conference on applications of computer vision, 2023, 3330-3339.
Kim, Hannah Halin, Shuzhi Yu, Shuai Yuan, and Carlo Tomasi. "Cross-attention transformer for video interpolation." In Proceedings of the Asian conference on computer vision, 320-337. 2022.
Zhang, Haokui, Wenze Hu, and Xiaoyu Wang. "Fcaformer: Forward cross attention in hybrid vision transformer." In Proceedings of the IEEE/CVF International Conference on Computer Vision, 6060-6069. 2023.
Gajjar, Dulari B., Prisha Faldu, Deep Rameshbhai Kothadiya, Aayushi Pushpakant Chaudhari, and Nikita M. Bhatt. "DeViTC: Deep-Vision Transformer to Recognize Originality of Currency." Computer 58, no. 5 (2025): 48-56.
Chen, Liyang, Zhiyuan You, Nian Zhang, Juntong Xi, and Xinyi Le. "UTRAD: Anomaly detection and localization with U-transformer." Neural Networks 147 (2022): 53-62.
Xu, Peng, Xiatian Zhu, and David A. Clifton. "Multimodal learning with transformers: A survey." IEEE Transactions on Pattern Analysis and Machine Intelligence 45, no. 10 (2023): 12113-12132.
Hu, Ronghang, and Amanpreet Singh. "Unit: Multimodal multitask learning with a unified transformer." In Proceedings of the IEEE/CVF international conference on computer vision, 1439-1449. 2021.
https://www.kaggle.com/datasets/odins0n/ucf-crime-dataset
Ivan Nikolov. (2024). Reactive Anomaly Synthetic Data[Data set]. Kaggle. https://doi.org/10.34740/KAGGLE/DSV/7948157
Vrskova, Roberta, Robert Hudec, Patrik Kamencay, and Peter Sykora. "Human activity classification using the 3DCNN architecture." Applied Sciences 12, no. 2 (2022): 931.
Ur Rehman, Atiq, Samir Brahim Belhaouari, Md Alamgir Kabir, and Adnan Khan. "On the use of deep learning for video classification." Applied Sciences 13, no. 3 (2023): 2007.
Ogawa, Takahiro, Yuma Sasaka, Keisuke Maeda, and Miki Haseyama. "Favorite video classification based on multimodal bidirectional LSTM." IEEE Access 6 (2018): 61401-61409.
Hendi, Sajjad H., Hazeem B. Taher, and Karim Q. Hussein. "Automated video events detection and classification using CNN-GRU model." Wasit Journal of Computer and Mathematics Science 2, no. 4 (2023): 77-86.
Zhuang, Xuqiang, Fang’ai Liu, Jian Hou, Jianhua Hao, and Xiaohong Cai. "Modality attention fusion model with hybrid multi-head self-attention for video understanding." Plos one 17, no. 10 (2022): e0275156.
Chi, Lu, Guiyu Tian, Yadong Mu, and Qi Tian. "Two-stream video classification with cross-modality attention." In Proceedings of the IEEE/CVF international conference on computer vision workshops, 0-0. 2019.
Long, Xiang, Chuang Gan, Gerard De Melo, Jiajun Wu, Xiao Liu, and Shilei Wen. "Attention clusters: Purely attention based local feature integration for video classification." In Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, 7834-7843.
Ling, Charles X., Jin Huang, and Harry Zhang. "AUC: a better measure than accuracy in comparing learning algorithms." In Conference of the canadian society for computational studies of intelligence, Berlin, Heidelberg: Springer Berlin Heidelberg, 2003, 329-341.
Borji, Ali. "Enhancing sensor resolution improves CNN accuracy given the same number of parameters or FLOPS." arXiv preprint arXiv:2103.05251 (2021).
Liu, Wen, Weixin Luo, Dongze Lian, and Shenghua Gao. "Future frame prediction for anomaly detection–a new baseline." In Proceedings of the IEEE conference on computer vision and pattern recognition, 6536-6545. 2018.

Cross Attention Based Feature Fusion Network for Robust Anomaly Detection in Surveillance Videos

How to Cite

Download Citation

Keywords

Abstract

References