Journal of Innovative Image Processing is accepted for inclusion in Scopus. click here
Home / Archives / Volume-7 / Issue-2 / Article-5

Volume - 7 | Issue - 2 | june 2025

Improving Narrative Coherence in Dense Video Captioning through Transformer and Large Language Models Open Access
Dvijesh Bhatt  , Priyank Thakkar  314
Pages: 333-361
Cite this article
Bhatt, Dvijesh, and Priyank Thakkar. "Improving Narrative Coherence in Dense Video Captioning through Transformer and Large Language Models." Journal of Innovative Image Processing 7, no. 2 (2025): 333-361
Published
03 June, 2025
Abstract

Dense video captioning aims to identify events within a video and generate natural language descriptions for each event. Most existing approaches adhere to a two-stage framework consisting of an event proposal module and a caption generation module. Previous methodologies have predominantly employed convolutional neural networks and sequential models to describe individual events in isolation. However, these methods limit the influence of neighboring events when generating captions for a specific segment, often resulting in descriptions that lack coherence with the broader storyline of the video. To address this limitation, we propose a captioning module that leverages both Transformer architecture and a Large Language Model (LLM). A convolutional and LSTM-based proposal module is used to detect and localize events within the video. An encoder-decoder-based Transformer model generates an initial caption for each proposed event. Additionally, we introduce a Large Language Model (LLM) that takes the set of individually generated event captions as input and produces a coherent, multi-sentence summary. This summary captures cross-event dependencies and provides a contextually unified and narratively rich description of the entire video. Extensive experiments on the ActivityNet dataset demonstrate that the proposed model, Transformer-LLM based Dense Video Captioning (TL-DVC), achieves a 9.22% improvement over state-of-the-art models, increasing the Meteor score from 11.28 to 12.32.

Keywords

Convolution 3D Dense Video Caption LSTM Transformer Encoder-Decoder LLM

×
Article Processing Charges

Journal of Innovative Image Processing (jiip) is an open access journal. When a paper is accepted for publication, authors are required to pay Article Processing Charges (APCs) to cover its editorial and production costs. The APC for each submission is 400 USD. There are no additional charges based on color, length, figures, or other elements.

Category Fee
Article Access Charge 30 USD
Article Processing Charge 400 USD
Annual Subscription Fee 200 USD
Payment Gateway
Paypal: click here
Townscript: click here
Razorpay: click here
After payment,
please send an email to irojournals.contact@gmail.com / journals@iroglobal.com requesting article access.
Subscription form: click here