Volume - 7 | Issue - 3 | september 2025
Published
23 September, 2025
Natural Scene Text Detection and Language Identification is a challenging problem in the field of computer vision, due to autonomous video surveillance and the design of an OCR system for natural scene images. The drawback of an autonomous video surveillance and monolingual OCR system is that it will not work efficiently on natural scene images, where text appears in different orientations, backgrounds, and lighting conditions with multilingual scripts. Hence, we proposed a deep learning model, i.e. fine-tuned YOLOv5, for text detection and language identification in bilingual scene images. For testing the proposed (fine-tuned) model, there is no standard ground truth database in the literature. Therefore, we created our own real-time natural scene dataset from the Kalaburagi and Bidar districts in the state of Karnataka. The proposed (fine-tuned) model involves training YOLOv5 on a real-time dataset, and it works with a genetic approach. It produces the anchor boxes for the objects present in the natural scene image. To test the performance of the fine-tuned YOLOv5 model, we employed evaluation metrics like precision, recall and accuracy. The experimental setup demonstrates robustness of the fine-tuned YOLOv5 model for text detection and language identification. We obtained an optimized precision rate of 86.8%, a recall rate of 83.4%, an F1 score of 85%, and an accuracy of 94.4%. The training of 80% and testing of 20% was carried out in the experiment. A comparative analysis of the fine-tuned YOLOv5 model with existing methods found in the literature is carried out, and observed that the fine-tuned YOLOv5 model shows better performance. The novelty of the paper is that the fine-tuned YOLOv5 model and dataset were constrained with a mixture of low-resolution and complex background images.
KeywordsYOLOv5 SPPF Deep Learning Computer Vision Image Processing