Improving Narrative Coherence in Dense Video Captioning through Transformer and Large Language Models
PDF
PDF

How to Cite

Bhatt, Dvijesh, and Priyank Thakkar. 2025. “Improving Narrative Coherence in Dense Video Captioning through Transformer and Large Language Models”. Journal of Innovative Image Processing 7 (2): 333-61. https://doi.org/10.36548/jiip.2025.2.005.

Keywords

  • Convolution 3D
  • Dense Video Caption
  • LSTM
  • Transformer
  • Encoder-Decoder
  • LLM

Abstract

Dense video captioning aims to identify events within a video and generate natural language descriptions for each event. Most existing approaches adhere to a two-stage framework consisting of an event proposal module and a caption generation module. Previous methodologies have predominantly employed convolutional neural networks and sequential models to describe individual events in isolation. However, these methods limit the influence of neighboring events when generating captions for a specific segment, often resulting in descriptions that lack coherence with the broader storyline of the video. To address this limitation, we propose a captioning module that leverages both Transformer architecture and a Large Language Model (LLM). A convolutional and LSTM-based proposal module is used to detect and localize events within the video. An encoder-decoder-based Transformer model generates an initial caption for each proposed event. Additionally, we introduce a Large Language Model (LLM) that takes the set of individually generated event captions as input and produces a coherent, multi-sentence summary. This summary captures cross-event dependencies and provides a contextually unified and narratively rich description of the entire video. Extensive experiments on the ActivityNet dataset demonstrate that the proposed model, Transformer-LLM based Dense Video Captioning (TL-DVC), achieves a 9.22% improvement over state-of-the-art models, increasing the Meteor score from 11.28 to 12.32.

References

Kim, Jinkyu, Anna Rohrbach, Trevor Darrell, John Canny, and Zeynep Akata. "Textual explanations for self-driving vehicles." In Proceedings of the European conference on computer vision (ECCV), 2018, 563-578.

Potapov, Danila, Matthijs Douze, Zaid Harchaoui, and Cordelia Schmid. "Category-specific video summarization." In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VI 13, Springer International Publishing, 2014, 540-555.

Dinh, Quang Minh, Minh Khoi Ho, Anh Quan Dang, and Hung Phong Tran. "Trafficvlm: A controllable visual language model for traffic video captioning." In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, 7134-7143.

Yang, Haojin, and Christoph Meinel. "Content based lecture video retrieval using speech and video text information." IEEE transactions on learning technologies 7, no. 2 (2014): 142-154.

Anne Hendricks, Lisa, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. "Localizing moments in video with natural language." In Proceedings of the IEEE international conference on computer vision, 2017, 5803-5812.

Aggarwal, Akshay, Aniruddha Chauhan, Deepika Kumar, Mamta Mittal, Sudipta Roy, and Tai-hoon Kim. "Video caption based searching using end-to-end dense captioning and sentence embeddings." Symmetry 12, no. 6 (2020): 992.

Wu, Shaomei, Jeffrey Wieland, Omid Farivar, and Julie Schiller. "Automatic alt-text: Computer-generated image descriptions for blind users on a social network service." In proceedings of the 2017 ACM conference on computer supported cooperative work and social computing, 2017, 1180-1192.

V. U. K. V. Y. S. D. R. B., Sai Jyothi and R. S. G., “Intelligent faq chatbot: A user-centric approach using large language models,” Journal of Artificial Intelligence and Capsule Networks, vol. 7, no. 1, 2025, https://doi.org/10.36548/jaicn.2025.1.006, 78–93.

Shi, Botian, Lei Ji, Yaobo Liang, Nan Duan, Peng Chen, Zhendong Niu, and Ming Zhou. "Dense procedure captioning in narrated instructional videos." In Proceedings of the 57th annual meeting of the association for computational linguistics, 2019, 6382-6391.

Chen, Yangyu, Shuhui Wang, Weigang Zhang, and Qingming Huang. "Less is more: Picking informative frames for video captioning." In Proceedings of the European conference on computer vision (ECCV), 2018, 358-373.

Suin, Maitreya, and A. N. Rajagopalan. "An efficient framework for dense video captioning." In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, 2020, 12039-12046.

Escorcia, Victor, Fabian Caba Heilbron, Juan Carlos Niebles, and Bernard Ghanem. "Daps: Deep action proposals for action understanding." In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III 14, Springer International Publishing, 2016, 768-784.

Buch, Shyamal, Victor Escorcia, Chuanqi Shen, Bernard Ghanem, and Juan Carlos Niebles. "Sst: Single-stream temporal action proposals." In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2017, 2911-2920.

Wang, Jingwen, Wenhao Jiang, Lin Ma, Wei Liu, and Yong Xu. "Bidirectional attentive fusion with context gating for dense video captioning." In Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, 7190-7198.

Mun, Jonghwan, Linjie Yang, Zhou Ren, Ning Xu, and Bohyung Han. "Streamlined dense video captioning." In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, 6588-6597.

Krishna, Ranjay, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. "Dense-captioning events in videos." In Proceedings of the IEEE international conference on computer vision, 2017, 706-715.

Aafaq, Nayyer, Ajmal Mian, Naveed Akhtar, Wei Liu, and Mubarak Shah. "Dense video captioning with early linguistic information fusion." IEEE Transactions on Multimedia 25 (2022): 2309-2322.

Li, Yehao, Ting Yao, Yingwei Pan, Hongyang Chao, and Tao Mei. "Jointly localizing and describing events for dense video captioning." In Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, 7492-7500.

GHuang, Gabriel, Bo Pang, Zhenhai Zhu, Clara Rivera, and Radu Soricut. "Multimodal pretraining for dense video captioning." arXiv preprint arXiv:2011.11760 (2020).

Shen, Zhiqiang, Jianguo Li, Zhou Su, Minjun Li, Yurong Chen, Yu-Gang Jiang, and Xiangyang Xue. "Weakly supervised dense video captioning." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, 1916-1924.

Krishna, Ranjay, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. "Dense-captioning events in videos." In Proceedings of the IEEE international conference on computer vision, 2017, 706-715.

Xu, Huijuan, Boyang Li, Vasili Ramanishka, Leonid Sigal, and Kate Saenko. "Joint event detection and description in continuous video streams." In 2019 IEEE winter conference on applications of computer vision (WACV), IEEE, 2019, 396-405.

Zhang, Zhiwang, Dong Xu, Wanli Ouyang, and Luping Zhou. "Dense video captioning using graph-based sentence summarization." IEEE Transactions on Multimedia 23 (2020): 1799-1810.

Qi, Charles R., Hao Su, Kaichun Mo, and Leonidas J. Guibas. "Pointnet: Deep learning on point sets for 3d classification and segmentation." In Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, 652-660.

Zhou, Luowei, Yingbo Zhou, Jason J. Corso, Richard Socher, and Caiming Xiong. "End-to-end dense video captioning with masked transformer." In Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, 8739-8748.

Hershey, Shawn, Sourish Chaudhuri, Daniel PW Ellis, Jort F. Gemmeke, Aren Jansen, R. Channing Moore, Manoj Plakal et al. "CNN architectures for large-scale audio classification." In 2017 ieee international conference on acoustics, speech and signal processing (icassp), IEEE, 2017, 131-135.

Iashin, Vladimir, and Esa Rahtu. "A better use of audio-visual cues: Dense video captioning with bi-modal transformer." arXiv preprint arXiv:2005.08271 (2020).

Yu, Zhou, and Nanjia Han. "Accelerated masked transformer for dense video captioning." Neurocomputing 445 (2021): 72-80.

Iashin, Vladimir, and Esa Rahtu. "Multi-modal dense video captioning." In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, 2020, 958-959.

Wang, Teng, Ruimao Zhang, Zhichao Lu, Feng Zheng, Ran Cheng, and Ping Luo. "End-to-end dense video captioning with parallel decoding." In Proceedings of the IEEE/CVF international conference on computer vision, 2021, 6847-6857.

Choi, Wangyu, Jiasi Chen, and Jongwon Yoon. "Parallel pathway dense video captioning with deformable transformer." IEEE Access 10 (2022): 129899-129910.

Rahman, Tanzila, Bicheng Xu, and Leonid Sigal. "Watch, listen and tell: Multi-modal weakly supervised dense event captioning." In Proceedings of the IEEE/CVF international conference on computer vision, 2019, 8908-8917.

Deng, Chaorui, Shizhe Chen, Da Chen, Yuan He, and Qi Wu. "Sketch, ground, and refine: Top-down dense video captioning." In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, 234-243.

Aafaq, Nayyer, Ajmal Mian, Naveed Akhtar, Wei Liu, and Mubarak Shah. "Dense video captioning with early linguistic information fusion." IEEE Transactions on Multimedia 25 (2022): 2309-2322.

Choi, Wangyu, Jiasi Chen, and Jongwon Yoon. "Step by step: A gradual approach for dense video captioning." IEEE Access 11 (2023): 51949-51959.

Yang, Antoine, Arsha Nagrani, Paul Hongsuck Seo, Antoine Miech, Jordi Pont-Tuset, Ivan Laptev, Josef Sivic, and Cordelia Schmid. "Vid2seq: Large-scale pretraining of a visual language model for dense video captioning." In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, 10714-10726.

Wei, Yiwei, Shaozu Yuan, Meng Chen, Xin Shen, Longbiao Wang, Lei Shen, and Zhiling Yan. "MPP-net: multi-perspective perception network for dense video captioning." Neurocomputing 552 (2023): 126523.

Zhou, Xingyi, Anurag Arnab, Shyamal Buch, Shen Yan, Austin Myers, Xuehan Xiong, Arsha Nagrani, and Cordelia Schmid. "Streaming dense video captioning." In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, 18243-18252.

Kim, Minkuk, Hyeon Bae Kim, Jinyoung Moon, Jinwoo Choi, and Seong Tae Kim. "Do you remember? dense video captioning with cross-modal memory retrieval." In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, 13894-13904.

Wu, Hao, Huabin Liu, Yu Qiao, and Xiao Sun. "DIBS: Enhancing dense video captioning with unlabeled videos via pseudo boundary enrichment and online refinement." In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, 18699-18708.

Duan, Xuguang, Wenbing Huang, Chuang Gan, Jingdong Wang, Wenwu Zhu, and Junzhou Huang. "Weakly supervised dense event captioning in videos." Advances in Neural Information Processing Systems 31 (2018).

Aytar, Yusuf, Carl Vondrick, and Antonio Torralba. "Soundnet: Learning sound representations from unlabeled video." Advances in neural information processing systems 29 (2016).

Estevam, Valter, Rayson Laroca, Helio Pedrini, and David Menotti. "Dense video captioning using unsupervised semantic information." arXiv preprint arXiv:2112.08455 (2021).

Ji, Shuiwang, Wei Xu, Ming Yang, and Kai Yu. "3D convolutional neural networks for human action recognition." IEEE transactions on pattern analysis and machine intelligence 35, no. 1 (2012): 221-231.

Duan, Xuguang, Wenbing Huang, Chuang Gan, Jingdong Wang, Wenwu Zhu, and Junzhou Huang. "Weakly supervised dense event captioning in videos." Advances in Neural Information Processing Systems 31 (2018).

Rahman, Tanzila, Bicheng Xu, and Leonid Sigal. "Watch, listen and tell: Multi-modal weakly supervised dense event captioning." In Proceedings of the IEEE/CVF international conference on computer vision, 2019, 8908-8917.

Song, Yuqing, Shizhe Chen, Yida Zhao, and Qin Jin. "Team ruc_aim3 technical report at activitynet 2020 task 2: Exploring sequential events detection for dense video captioning." arXiv preprint arXiv:2006.07896 (2020).

Wang, Teng, Huicheng Zheng, Mingjing Yu, Qian Tian, and Haifeng Hu. "Event-centric hierarchical representation for dense video captioning." IEEE Transactions on Circuits and Systems for Video Technology 31, no. 5 (2020): 1890-1900.

Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. "Bleu: a method for automatic evaluation of machine translation." In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 2002, 311-318.

Banerjee, Satanjeev, and Alon Lavie. "METEOR: An automatic metric for MT evaluation with improved correlation with human judgments." In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, 2005, 65-72.

Vedantam, Ramakrishna, C. Lawrence Zitnick, and Devi Parikh. "Cider: Consensus-based image description evaluation." In Proceedings of the IEEE conference on computer vision and pattern recognition 2015, 4566-4575.

Lin, Chin-Yew. "Rouge: A package for automatic evaluation of summaries." In Text summarization branches out, 2004, 74-81.

Fujita, Soichiro, Tsutomu Hirao, Hidetaka Kamigaito, Manabu Okumura, and Masaaki Nagata. "Soda: Story oriented dense video captioning evaluation framework." In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VI 16, Springer International Publishing, 2020, 517-531.