Multi-Modal Deepfake Detection via Spatial, Temporal, and Audio-Visual Fusion with Vision Transformers

Merlin Gethsy D.; Sharmila V.

doi:10.36548/jscp.2026.3.003

Multi-Modal Deepfake Detection via Spatial, Temporal, and Audio-Visual Fusion with Vision Transformers

Open Access

https://doi.org/10.36548/jscp.2026.3.003

Vol. 8, No. 3 (2026)

Published: 30 June, 2026

Pages: 219-238

Merlin Gethsy D. , Merlin Gethsy D.

Assistant Professor, Department of Computer Science and Engineering, V V College of Engineering, Tisaiyanvilai, Thoothukudi, India

Assistant Professor, Department of Computer Science and Engineering, V V College of Engineering, Tisaiyanvilai, Thoothukudi, India
Sharmila V. Sharmila V.

PG Student, Department of Computer Science and Engineering, V V College of Engineering, Tisaiyanvilai, Thoothukudi, India

PG Student, Department of Computer Science and Engineering, V V College of Engineering, Tisaiyanvilai, Thoothukudi, India

view PDF

How to Cite

D., Merlin Gethsy, and Sharmila V. 2026. “Multi-Modal Deepfake Detection via Spatial, Temporal, and Audio-Visual Fusion With Vision Transformers”. Journal of Soft Computing Paradigm 8 (3): 219-38. https://doi.org/10.36548/jscp.2026.3.003.

Keywords

Deepfake Detection

Vision Transformer

Multi-Modal Fusion

Temporal Consistency Analysis

Audio-Visual Synchronization

Digital Media Forensics

Abstract

The rapid advancement of the deepfake generation technologies has intensified concerns related to digital misinformation, identity impersonation, and media manipulation. Although numerous deepfake detection methods have been developed by mitigate these threats, most rely on a single modality and exhibit limited robustness when confronted with diverse manipulation techniques and cross-dataset scenarios. To overcome these deficiencies, we propose VeriSphere, a multimodal deepfake detection framework that combines spatial, temporal, and audiovisual forensics in one system. It uses a Vision Transformer for detecting spatial artifacts, an X-CLIP-based module for capturing temporality, and an AV synchronization module to examine whether speech aligns with lip movements. The outputs are then fused using a weighted strategy to produce a single trust score for prediction. Results show that VeriSphere achieves a high accuracy of 92.1%, an AUC of 0.963, and an F1-score of 0.924 across three benchmark datasets: FaceForensics++, Celeb-DF, and DFDC.

References

Rossler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies, and M. Nießner, “FaceForensics++: Learning to Detect Manipulated Facial Images,” In Proceedings of the IEEE/CVF International Conference of Computer Vision (ICCV), Seoul, South Korea, 2019: 1–11.
Wang, Junke, Zuxuan Wu, Wenhao Ouyang, Xintong Han, Jingjing Chen, Yu-Gang Jiang, and Ser-Nam Li. "M2tr: Multi-Modal Multi-Scale Transformers for Deepfake Detection." In Proceedings of the International Conference on Multimedia Retrieval, 2022: 615-623.
Li, Yuezun, Ming-Ching Chang, and Siwei Lyu. "In Ictu Oculi: Exposing Ai Created Fake Videos by Detecting Eye Blinking." International workshop on information forensics and security (WIFS), IEEE, 2018: 1-7.
Ni, Bolin, Houwen Peng, Minghao Chen, Songyang Zhang, Gaofeng Meng, Jianlong Fu, Shiming Xiang, and Haibin Ling. "Expanding Language-Image Pretrained Models for General Video Recognition." In European conference on computer vision, Cham: Springer Nature Switzerland, 2022: 1-18.
Chung, Joon Son, and Andrew Zisserman. "Out of Time: Automated Lip Sync In The Wild." In Asian conference on computer vision, Cham: Springer International Publishing, 2016: 251-263.
Baevski, Alexei, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. "Wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations." Advances in neural information processing systems 2020, vol. 33: 12449-12460.
Li, Yuezun, Xin Yang, Pu Sun, Honggang Qi, and Siwei Lyu. "Celeb-Df: A Large-Scale Challenging Dataset for Deepfake Forensics." In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition 2020: 3207-3216.
Dolhansky, Brian, Joanna Bitton, Ben Pflaum, Jikuo Lu, Russ Howes, Menglin Wang, and Cristian Canton Ferrer. "The Deepfake Detection Challenge (DFDC) Dataset." arXiv preprint 2020, arXiv:2006.07397.
Dosovitskiy, Alexey, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale." arXiv preprint 2020, arXiv:2010.11929.
Chollet, François. "Xception: Deep Learning with Depthwise Separable Convolutions." In Proceedings of the IEEE conference on computer vision and pattern recognition 2017: 1251-1258.
Radford, Alec, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry "Learning Transferable Visual Models from Natural Language Supervision." In International conference on machine learning, Proceedings of Machine Learning Research, 2021, vol. 139: 8748-8763.
Tolosana, Ruben, Ruben Vera-Rodriguez, Julian Fierrez, Aythami Morales, and Javier Ortega-Garcia. "Deepfakes and Beyond: A Survey of Face Manipulation and Fake Detection." Information fusion 2020, vol. 64: 131-148.
Goodfellow, Ian J., Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. "Generative Adversarial Nets." Advances in neural information processing systems 2014, vol. 27: 1-9.
Nguyen, Thanh Thi, Quoc Viet Hung Nguyen, Dung Tien Nguyen, Duc Thanh Nguyen, Thien Huynh-The, Saeid Nahavandi, Thanh Tam Nguyen, Quoc-Viet Pham, and Cuong M. Nguyen. "Deep Learning for Deepfakes Creation and Detection: A Survey." Computer Vision and Image Understanding 2022, vol. 223: 103525.
Zhao, Hanqing, Wenbo Zhou, Dongdong Chen, Tianyi Wei, Weiming Zhang, and Nenghai Yu. "Multi-Attentional Deepfake Detection." In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition 2021: 2185-2194.

Multi-Modal Deepfake Detection via Spatial, Temporal, and Audio-Visual Fusion with Vision Transformers

How to Cite

Download Citation

Keywords

Abstract

References