Nepali Image Captioning: Generating Coherent Paragraph-Length Descriptions Using Transformer

Nabaraj Subedi; Nirajan Paudel; Manish Chhetri; Sudarshan Acharya; Nabin Lamichhane

doi:10.36548/jscp.2024.1.006

Nepali Image Captioning: Generating Coherent Paragraph-Length Descriptions Using Transformer

Open Access

https://doi.org/10.36548/jscp.2024.1.006

Vol. 6, No. 1 (2024)

Published: 30 April, 2024

Pages: 70-84

Nabaraj Subedi , Nabaraj Subedi

Electronics and Computer Department, Paschimanchal Campus, TU, Pokhara, Nepal

Electronics and Computer Department, Paschimanchal Campus, TU, Pokhara, Nepal
Nirajan Paudel , Nirajan Paudel

Electronics and Computer Department, Paschimanchal Campus, TU, Pokhara, Nepal

Electronics and Computer Department, Paschimanchal Campus, TU, Pokhara, Nepal
Manish Chhetri , Manish Chhetri

Electronics and Computer Department, Paschimanchal Campus, TU, Pokhara, Nepal

Electronics and Computer Department, Paschimanchal Campus, TU, Pokhara, Nepal
Sudarshan Acharya , Sudarshan Acharya

Electronics and Computer Department, Paschimanchal Campus, TU, Pokhara, Nepal

Electronics and Computer Department, Paschimanchal Campus, TU, Pokhara, Nepal
Nabin Lamichhane Nabin Lamichhane

Electronics and Computer Department, Paschimanchal Campus, TU, Pokhara, Nepal

Electronics and Computer Department, Paschimanchal Campus, TU, Pokhara, Nepal

view PDF

How to Cite

Subedi, Nabaraj, Nirajan Paudel, Manish Chhetri, Sudarshan Acharya, and Nabin Lamichhane. 2024. “Nepali Image Captioning: Generating Coherent Paragraph-Length Descriptions Using Transformer”. Journal of Soft Computing Paradigm 6 (1): 70-84. https://doi.org/10.36548/jscp.2024.1.006.

Keywords

BLEU

Inception V3

Nepali Captions

Transformer

Abstract

The advent of deep neural networks has made the image captioning task more feasible. It is a method of generating text by analyzing the different parts of an image. A lot of tasks related to this have been done in the English language, while very little effort is put into this task in other languages, particularly the Nepali language. It is an even harder task to carry out research in the Nepali language because of its difficult grammatical structure and vast language domain. Further, the little work done in the Nepali language is done to generate only a single sentence, but the proposed work emphasizes generating paragraph-long coherent sentences. The Stanford human genome dataset, which was translated into Nepali language using the Google Translate API is used in the proposed work. Along with this, a manually curated dataset consisting of 800 images of the cultural sites of Nepal, along with their Nepali captions, was also used. These two datasets were combined to train the deep learning model. The task involved working with transformer architecture. In this setup, image features were extracted using a pretrained Inception V3 model. These features were then inputted into the encoder segment after position encoding. Simultaneously, embedded tokens from captions were fed into the decoder segment. The resulting captions were assessed using BLEU scores, revealing higher accuracy and BLEU scores for the test images.

References

Krause, Jonathan, Justin Johnson, Ranjay Krishna, and Li Fei-Fei. "A hierarchical approach for generating descriptive image paragraphs." In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 317-325. 2017.
Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. "Attention is all you need." Advances in neural information processing systems 30 (2017).
Ekman, Magnus. Learning deep learning: Theory and practice of neural networks, computer vision, natural language processing, and transformers using TensorFlow. Addison-Wesley Professional, 2021.
A. Adhikari and S. Ghimire, “Nepali Image Captioning,” in 2019 Artificial Intelligence for Transforming Business and Society (AITB), Kathmandu, Nepal: IEEE, Nov. 2019, pp. 1–6. doi: 10.1109/AITB48515.2019.8947436.
R. Budhathoki and S. Timilsina, “Image Captioning in Nepali Using CNN and Transformer Decoder,” J. Eng. Sci., vol. 2, no. 1, pp. 41–48, Dec. 2023, doi: 10.3126/jes2.v2i1.60391.
Subedi, Bipesh, and Bal Krishna Bal. "CNN-Transformer based Encoder-Decoder Model for Nepali Image Captioning." In Proceedings of the 19th International Conference on Natural Language Processing (ICON), pp. 86-91. 2022. ”.
Xu, Kelvin, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. "Show, attend and tell: Neural image caption generation with visual attention." In International conference on machine learning, pp. 2048-2057. PMLR, 2015.
Chen, Minghai, Guiguang Ding, Sicheng Zhao, Hui Chen, Qiang Liu, and Jungong Han. "Reference based LSTM for image captioning." In Proceedings of the AAAI conference on artificial intelligence, vol. 31, no. 1. 2017. 3981-3987
A. S. Ami, M. Humaira, M. A. R. K. Jim, S. Paul, and F. M. Shah, “Bengali Image Captioning with Visual Attention,” in 2020 23rd International Conference on Computer and Information Technology (ICCIT), DHAKA, Bangladesh: IEEE, Dec. 2020, pp. 1–5. doi: 10.1109/ICCIT51783.2020.9392709.
Muhammad Shah, Faisal, Mayeesha Humaira, Md Abidur Rahman Khan Jim, Amit Saha Ami, and Shimul Paul. "Bornon: Bengali image captioning with transformer-based deep learning approach." SN Computer Science 3 (2022): 1-16.
S. K. Mishra, R. Dhir, S. Saha, P. Bhattacharyya, and A. K. Singh, “Image captioning in Hindi language using transformer networks,” Comput. Electr. Eng., vol. 92, p. 107114, Jun. 2021, doi: 10.1016/j.compeleceng.2021.107114.
T.-Y. Lin et al., “Microsoft COCO: Common Objects in Context,” in Computer Vision – ECCV 2014, vol. 8693, D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, Eds., in Lecture Notes in Computer Science, vol. 8693. , Cham: Springer International Publishing, 2014, pp. 740–755. doi: 10.1007/978-3-319-10602-1_48.
Y. Bazi, L. Bashmal, M. M. A. Rahhal, R. A. Dayil, and N. A. Ajlan, “Vision Transformers for Remote Sensing Image Classification,” Remote Sens., vol. 13, no. 3, p. 516-534, Feb. 2021, doi: 10.3390/rs13030516.
R. Krishna et al., “Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations,” Int. J. Comput. Vis., vol. 123, no. 1, pp. 32–73, May 2017, doi: 10.1007/s11263-016-0981-7.
X. Shen, B. Liu, Y. Zhou, and J. Zhao, “Remote sensing image caption generation via transformer and reinforcement learning,” Multimed. Tools Appl., vol. 79, no. 35–36, pp. 26661–26682, Sep. 2020, doi: 10.1007/s11042-020-09294-7.
K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA: IEEE, Jun. 2016, pp. 770–778. doi: 10.1109/CVPR.2016.90.
K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU: a method for automatic evaluation of machine translation,” in Proceedings of the 40th Annual Meeting on Association for Computational Linguistics - ACL ’02, Philadelphia, Pennsylvania: Association for Computational Linguistics, 2001, p. 311. doi: 10.3115/1073083.1073135.

Nepali Image Captioning: Generating Coherent Paragraph-Length Descriptions Using Transformer

How to Cite

Download Citation

Keywords

Abstract

References