Improving Temporal Localization in Vision-Language Video Anomaly Detection

Siddharth Shah; Narendrasinh Chauhan

doi:10.36548/jiip.2026.3.002

Improving Temporal Localization in Vision-Language Video Anomaly Detection

Open Access

https://doi.org/10.36548/jiip.2026.3.002

Vol. 8, No. 3 (2026)

Published: 05 June, 2026

Pages: 765-785

Siddharth Shah , Siddharth Shah

Research Scholar, The Charutar Vidya Mandal (CVM) University, Anand, Gujarat, India

Research Scholar, The Charutar Vidya Mandal (CVM) University, Anand, Gujarat, India
Narendrasinh Chauhan Narendrasinh Chauhan

Department of Information Technology, A. D. Patel Institute of Technology, The Charutar Vidya Mandal (CVM) University, Anand, Gujarat, India

Department of Information Technology, A. D. Patel Institute of Technology, The Charutar Vidya Mandal (CVM) University, Anand, Gujarat, India

view PDF

How to Cite

Shah, Siddharth, and Narendrasinh Chauhan. 2026. “Improving Temporal Localization in Vision-Language Video Anomaly Detection”. Journal of Innovative Image Processing 8 (3): 765-85. https://doi.org/10.36548/jiip.2026.3.002.

Keywords

Video Anomaly Detection

Weakly Supervised Learning

Vision-Language Models

Temporal Localization

Abstract

The recent developments in the vision-language-based model framework for weakly supervised video anomaly detection (WSVAD) have significantly enhanced anomaly detection performance. The dual-branch framework consisting of two branches for performing binary classification and aligning textual descriptors with visual snippets, has been found efficient concerning anomaly detection. Nonetheless, a significant problem of temporal localization still persists. The existing solutions use a fixed value of top-k snippets without consideration for either short or long anomalies. Moreover, prediction inconsistency in terms of temporal localization with the presence of spikes in normal periods and gaps in anomalies is another issue that arises. The current solution is based on the VadCLIP framework and only modifies some specific aspects of it. First, Confidence-Adaptive MIL (CA-MIL) computes a per-video threshold from the score distribution, selecting fewer snippets when confidence lowers and more when an anomalous event has a larger time frame. Second, a temporal smoothness term penalizes abrupt score transitions between adjacent snippets. Third, two parallel scoring heads, one point-wise MLP, and one local-context convolution are fused through learned gating that accounts for disagreement. Lastly, at test time, Score-level Temporal Context Aggregation (STCA) smooths the final predictions using local averaging and global statistics. Cross-modal attention provides a small additional boost to AUC. In UCF-Crime, the average mAP between the 0.1–0.5 IoU thresholds increases from 6.68 to 9.37 (+40.3%), with mAP@0.5. XD-Violence sees an average increase in mAP from 24.70 to 31.63 (+28.1%). Detection performance is preserved (UCF-Crime AUC decreases by 0.10% from 88.02 to 87.92; XD-Violence AP increases by 0.22% from 84.51 to 84.73).

References

Sultani, Waqas, Chen Chen, and Mubarak Shah. "Real-World Anomaly Detection in Surveillance Videos." In Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, 6479-6488.
Zhong, Jia-Xing, Nannan Li, Weijie Kong, Shan Liu, Thomas H. Li, and Ge Li. "Graph Convolutional Label Noise Cleaner: Train a Plug-And-Play Action Classifier for Anomaly Detection." In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, 1237-1246.
Tian, Yu, Guansong Pang, Yuanhong Chen, Rajvinder Singh, Johan W. Verjans, and Gustavo Carneiro. "Weakly-Supervised Video Anomaly Detection with Robust Temporal Feature Magnitude Learning." In Proceedings of the IEEE/CVF international conference on computer vision, 2021, 4975-4986.
Li, Shuo, Fang Liu, and Licheng Jiao. "Self-Training Multi-Sequence Learning with Transformer for Weakly Supervised Video Anomaly Detection." In Proceedings of the AAAI conference on artificial intelligence, vol. 36, no. 2, 2022, 1395-1403.
Zhou, Hang, Junqing Yu, and Wei Yang. "Dual Memory Units with Uncertainty Regulation for Weakly Supervised Video Anomaly Detection." In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 3, 2023, 3769-3777.
Radford, Alec, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry et al. "Learning Transferable Visual Models from Natural Language Supervision." In International conference on machine learning, PmLR, 2021, 8748-8763.
Wu, Peng, Xuerong Zhou, Guansong Pang, Lingru Zhou, Qingsen Yan, Peng Wang, and Yanning Zhang. "Vadclip: Adapting Vision-Language Models for Weakly Supervised Video Anomaly Detection." In Proceedings of the AAAI conference on artificial intelligence, vol. 38, no. 6, 2024, 6074-6082.
Wu, Peng, Jing Liu, Yujia Shi, Yujia Sun, Fangtao Shao, Zhaoyang Wu, and Zhiwei Yang. "Not Only Look, But Also Listen: Learning Multimodal Violence Detection Under Weak Supervision." In European conference on computer vision, Cham: Springer International Publishing, 2020, 322-339.
Feng, Jia-Chang, Fa-Ting Hong, and Wei-Shi Zheng. "Mist: Multiple Instance Self-Training Framework for Video Anomaly Detection." In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, 14009-14018.
Chen, Yingxian, Zhengzhe Liu, Baoheng Zhang, Wilton Fok, Xiaojuan Qi, and Yik-Chung Wu. "Mgfn: Magnitude-Contrastive Glance-And-Focus Network for Weakly-Supervised Video Anomaly Detection." In Proceedings of the AAAI conference on artificial intelligence, vol. 37, no. 1, 2023, 387-395.
Shah, Siddharth, and Dr Narendrasinh Chauhan. "HACSPT: A Hybrid Adaptive Contrastive Self-Paced Transformer for Video Anomaly Detection: S. Shah, Dr. N. Chauhan." Signal, Image and Video Processing 19, no. 14 (2025): 1187.
Zhou, Kaiyang, Jingkang Yang, Chen Change Loy, and Ziwei Liu. "Learning to Prompt for Vision-Language Models." International journal of computer vision 130, no. 9 (2022): 2337-2348.
Joo, Hyekang Kevin, Khoa Vo, Kashu Yamazaki, and Ngan Le. "Clip-TSA: Clip-Assisted Temporal Self-Attention for Weakly-Supervised Video Anomaly Detection." In 2023 IEEE International Conference on Image Processing (ICIP), IEEE, 2023, 3230-3234.
Lv, Hui, Zhongqi Yue, Qianru Sun, Bin Luo, Zhen Cui, and Hanwang Zhang. "Unbiased Multiple Instance Learning for Weakly Supervised Video Anomaly Detection." In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, 8022-8031.
Wu, Peng, Xuerong Zhou, Guansong Pang, Yujia Sun, Jing Liu, Peng Wang, and Yanning Zhang. "Open-Vocabulary Video Anomaly Detection." In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, 18297-18307.
Yang, Zhiwei, Jing Liu, and Peng Wu. "Text Prompt with Normality Guidance for Weakly Supervised Video Anomaly Detection." In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, 18899-18908.
Liu, Long, Jianjun Li, Guang Li, Yunfeng Zhai, and Ming Zhang. "VadCLIP++: Dynamic Vision-Language Model for Weakly Supervised Video Anomaly Detection." Digital Signal Processing (2025): 105560.
Li, Min, Jing Sang, Yuanyao Lu, and Lina Du. "WSVAD-CLIP: Temporally Aware and Prompt Learning with CLIP for Weakly Supervised Video Anomaly Detection." Journal of Imaging 11, no. 10 (2025): 354.
Lan, Tian, Yang Wang, and Greg Mori. "Discriminative Figure-Centric Models for Joint Action Localization and Recognition." In 2011 International conference on computer vision, IEEE, 2011, 2003-2010.
Huang, Chao, Chengliang Liu, Jie Wen, Lian Wu, Yong Xu, Qiuping Jiang, and Yaowei Wang. "Weakly Supervised Video Anomaly Detection via Self-Guided Temporal Discriminative Transformer." IEEE Transactions on Cybernetics 54, no. 5 (2022): 3197-3210.
Wu, Peng, Xiaotao Liu, and Jing Liu. "Weakly Supervised Audio-Visual Violence Detection." IEEE Transactions on Multimedia 25 (2022): 1674-1685.

Improving Temporal Localization in Vision-Language Video Anomaly Detection

How to Cite

Download Citation

Keywords

Abstract

References