IRO Journals

Journal of Soft Computing Paradigm

An Accurate Bitcoin Price Prediction using logistic regression with LSTM Machine Learning model
Volume-3 | Issue-3

Design of Distribution Transformer Health Management System using IoT Sensors
Volume-3 | Issue-3

Energy Management System in the Vehicles using Three Level Neuro Fuzzy Logic
Volume-3 | Issue-3

Cloud Load Estimation with Deep Logarithmic Network for Workload and Time Series Optimization
Volume-3 | Issue-3

Design of a Food Recommendation System using ADNet algorithm on a Hybrid Data Mining Process
Volume-3 | Issue-4

Review on Data Securing Techniques for Internet of Medical Things
Volume-3 | Issue-3

Automatic Diagnosis of Alzheimer’s disease using Hybrid Model and CNN
Volume-3 | Issue-4

Population Based Meta Heuristics Algorithm for Performance Improvement of Feed Forward Neural Network
Volume-2 | Issue-1

Comparative Analysis of an Efficient Image Denoising Method for Wireless Multimedia Sensor Network Images in Transform Domain
Volume-3 | Issue-3

A Comprehensive Review on Power Efficient Fault Tolerance Models in High Performance Computation Systems
Volume-3 | Issue-3

An Integrated Approach for Crop Production Analysis from Geographic Information System Data using SqueezeNet
Volume-3 | Issue-4

An Accurate Bitcoin Price Prediction using logistic regression with LSTM Machine Learning model
Volume-3 | Issue-3

Design of Distribution Transformer Health Management System using IoT Sensors
Volume-3 | Issue-3

Design of a Food Recommendation System using ADNet algorithm on a Hybrid Data Mining Process
Volume-3 | Issue-4

Automatic Diagnosis of Alzheimer’s disease using Hybrid Model and CNN
Volume-3 | Issue-4

Effective Prediction of Online Reviews for Improvement of Customer Recommendation Services by Hybrid Classification Approach
Volume-3 | Issue-4

Acoustic Features Based Emotional Speech Signal Categorization by Advanced Linear Discriminator Analysis
Volume-3 | Issue-4

Analysis of Statistical Trends of Future Air Pollutants for Accurate Prediction
Volume-3 | Issue-4

Identification of Electricity Threat and Performance Analysis using LSTM and RUSBoost Methodology
Volume-3 | Issue-4

Review on Data Securing Techniques for Internet of Medical Things
Volume-3 | Issue-3

Home / Archives / Volume-4 / Issue-4 / Article-5

Volume - 4 | Issue - 4 | december 2022

Implications of Tokenizers in BERT Model for Low-Resource Indian Language
N. Venkatesan  , N. Arulanand
Pages: 264-271
Cite this article
Venkatesan, N. & Arulanand, N. (2022). Implications of Tokenizers in BERT Model for Low-Resource Indian Language. Journal of Soft Computing Paradigm, 4(4), 264-271. doi:10.36548/jscp.2022.4.005
Published
18 January, 2023
Abstract

For any deep learning language model, the initial tokens are prepared as a part of the text preparation process, Tokenization. Important de facto models like BERT and GPT de facto utilize WordPiece and Byte Pair Encoding (BPE) as approaches. Tokenization may have a distinct impact on models for low-resource languages, such as the south Indian Dravidian languages, where many words may be produced by adding prefixes and suffixes. In this paper, four tokenizers are compared at various granularity levels, i.e., their outputs range from the tiniest individual letters to words in their most basic form. Using the BERT pretraining process on the Tamil text, these tokenizers as well as the language models are trained. The model is then fine-tuned with numerous parameters adjusted for the improved performance for a subsequent job in Tamil text categorization. The custom-built tokenizer for Tamil text is created and trained with BPE, WordPiece Vocabulary, Unigram, and WordLevel mechanisms and the compared results are presented after the downstream task of Tamil text categorization is performed using the BERT algorithm.

Keywords

Tokenization WordPiece Byte Pair Encoding (BPE) Unigram WordLevel low resource language Tamil BERT

Full Article PDF Download Article PDF 
×

Currently, subscription is the only source of revenue. The subscription resource covers the operating expenses such as web presence, online version, pre-press preparations, and staff wages.

To access the full PDF, please complete the payment process.

Subscription Details

Category Fee
Article Access Charge
For single article (Indian)
1,200 INR
Article Access Charge
For single article (non-Indian)
15 USD
Open Access Fee (Indian) 5,000 INR
Open Access Fee (non-Indian) 80 USD
Annual Subscription Fee
For 1 Journal (Indian)
15,000 INR
Annual Subscription Fee
For 1 Journal (non-Indian)
200 USD
secure PAY INR / USD
Subscription form: click here