Improving Narrative Coherence in Dense Video Captioning through Transformer and Large Language Models

Bhatt, Dvijesh; Thakkar, Priyank

Dense video captioning aims to identify events within a video and generate natural language descriptions for each event. Most existing approaches adhere to a two-stage framework consisting of an event proposal module and a caption generation module. Previous methodologies have predominantly employed convolutional neural networks and sequential models to describe individual events in isolation. However, these methods limit the influence of neighboring events when generating captions for a specific segment, often resulting in descriptions that lack coherence with the broader storyline of the video. To address this limitation, we propose a captioning module that leverages both Transformer architecture and a Large Language Model (LLM). A convolutional and LSTM-based proposal module is used to detect and localize events within the video. An encoder-decoder-based Transformer model generates an initial caption for each proposed event. Additionally, we introduce a Large Language Model (LLM) that takes the set of individually generated event captions as input and produces a coherent, multi-sentence summary. This summary captures cross-event dependencies and provides a contextually unified and narratively rich description of the entire video. Extensive experiments on the ActivityNet dataset demonstrate that the proposed model, Transformer-LLM based Dense Video Captioning (TL-DVC), achieves a 9.22% improvement over state-of-the-art models, increasing the Meteor score from 11.28 to 12.32.

Category	Fee
Article Access Charge	30 USD
Article Processing Charge	400 USD
Annual Subscription Fee	200 USD

Volume - 7 | Issue - 2 | june 2025

Dvijesh Bhatt

DOI

10.36548/jiip.2025.2.005

Published

03 June, 2025

e-ISSN: 2582-4252
4 issues per year
DOI: https://doi.org/10.36548/jiip

Indexing
Scopus | GoogleScholar | Crossref | MicrosoftAcademic | ScienceGate | J-Gate

Publisher

Inventive Research Organization

Open Access Journal