Implications of Tokenizers in BERT Model for Low-Resource Indian Language

N. Venkatesan; N. Arulanand

doi:10.36548/jscp.2022.4.005

Implications of Tokenizers in BERT Model for Low-Resource Indian Language

Open Access

https://doi.org/10.36548/jscp.2022.4.005

Vol. 4, No. 4 (2022)

Published: 18 January, 2023

Pages: 264-271

N. Venkatesan , N. Venkatesan

Department of Computer Science and Engineering, PSG College of Technology, Coimbatore, India

Department of Computer Science and Engineering, PSG College of Technology, Coimbatore, India
N. Arulanand N. Arulanand

Professor, Department of Computer Science and Engineering, PSG College of Technology, Coimbatore, India

Professor, Department of Computer Science and Engineering, PSG College of Technology, Coimbatore, India

view PDF

How to Cite

Venkatesan, N., and N. Arulanand. 2023. “Implications of Tokenizers in BERT Model for Low-Resource Indian Language”. Journal of Soft Computing Paradigm 4 (4): 264-71. https://doi.org/10.36548/jscp.2022.4.005.

Keywords

Tokenization

WordPiece

Byte Pair Encoding (BPE)

Unigram

WordLevel

low resource language

Tamil

BERT

Abstract

For any deep learning language model, the initial tokens are prepared as a part of the text preparation process, Tokenization. Important de facto models like BERT and GPT de facto utilize WordPiece and Byte Pair Encoding (BPE) as approaches. Tokenization may have a distinct impact on models for low-resource languages, such as the south Indian Dravidian languages, where many words may be produced by adding prefixes and suffixes. In this paper, four tokenizers are compared at various granularity levels, i.e., their outputs range from the tiniest individual letters to words in their most basic form. Using the BERT pretraining process on the Tamil text, these tokenizers as well as the language models are trained. The model is then fine-tuned with numerous parameters adjusted for the improved performance for a subsequent job in Tamil text categorization. The custom-built tokenizer for Tamil text is created and trained with BPE, WordPiece Vocabulary, Unigram, and WordLevel mechanisms and the compared results are presented after the downstream task of Tamil text categorization is performed using the BERT algorithm.

References

Tomás Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space. In 1st International Conference on Learning Representations, ICLR, Workshop Track Proceedings. Scottsdale, Arizona, USA.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998–6008
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minnesota, 4171–4186. https://doi.org/10.18653/v1/N19-1423
Rai, A., Borah, S. (2021). Study of Various Methods for Tokenization. In: Mandal, J., Mukhopadhyay, S., Roy, A. (eds) Applications of Internet of Things. Lecture Notes in Networks and Systems, vol 137. Springer, Singapore. https://doi.org/10.1007/978-981-15-6198-6_18
Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural Machine Translation of Rare Words with Subword Units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Berlin, Germany, 1715–1725. https://doi.org/10.18653/v1/P16-1162
Mike Schuster and Kaisuke Nakajima. 2012. Japanese and Korean Voice Search. In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 5149–5152.
Anmol Nayak, Hariprasad Timmapathini, Karthikeyan Ponnalagu, and Vijendran Gopalan Venkoparao. 2020. Domain Adaptation Challenges of BERT in Tokenization and Sub-word Representations of Out-of-Vocabulary words. In Proceedings of the First Workshop on Insights from Negative Results in NLP. Association for Computational Linguistics, Online, 1–5. https://doi.org/10.18653/v1/2020.insights-1.1
Chengyue Gong, Di He, Xu Tan, Tao Qin, Liwei Wang, and Tie-Yan Liu. 2018. Frage: Frequency-agnostic Word Representation. Advances in Neural Information Processing Systems 31 (2018).
Taku Kudo. 2018. Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Melbourne, Australia, 66–75
Schuster, Samuel, et al. "Unicode-aware tokenization and text normalization for NLP in low-resource languages." Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). 2018.
Rajeswari R. L., Ramakrishnan K. R. and Srinivasan R. (2010) Tokenizing and Stemming Tamil Text. In: Kinshuk D., Tsai C., Chen N., Huang T. (eds) Emerging Research in Web Information Systems and Mining. ICWISM 2010. Communications in Computer and Information Science, vol 93. Springer, Berlin, Heidelberg

Implications of Tokenizers in BERT Model for Low-Resource Indian Language

How to Cite

Download Citation

Keywords

Abstract

References