Abstract
A procedure called sandhi is used in Sanskrit to join short words (morphemes) to create compound words. A composite word are broken down into their component morphemes by a process known as sandhi splitting. This study focuses on several performance technologies and methodologies used to perform the above operation on Sanskrit sentences. Various approaches were identified for the problem from the literature survey. Initial approaches involved use of Finite State Transducers. Earlier the approaches introduced to increase accuracy include use of mathematical models and various optimality theories. Graph based approaches and parser based techniques were introduced later. With the advancement of deep learning techniques Recurrent Neural Networks, Long-Short Term Memory models and Double decoder models were adopted which involved training machine learning models through neural networks and classifier algorithms. Bidirectional LSTM models with attention mechanism, transformer based models and large language models like BERT were the most recent methodologies adopted and proved to be of higher accuracy and performance.
References
Gérard, Huet. "Lexicon-directed segmentation and tagging of Sanskrit." In XIIth World Sanskrit Conference, Helsinki, Finland, Aug, pp. 307-325. 2003.
Vipul Mittal. 2010. Automatic Sanskrit Segmentizer Using Finite State Transducers. In Proceedings of the ACL 2010 Student Research Workshop. Association for Computational Linguistics, Uppsala, Sweden, 85–90. https://www.aclweb.org/ anthology/P10-3015.
Abhiram Natarajan and Eugene Charniak. 2011. S3 - Statistical Sandhi Splitting.InProceedings of 5th International Joint Conference on Natural LanguageProcessing. Asian Federation of Natural Language Processing, Chiang Mai, Thailand, 301– 308.
Amba Kulkarni and Devanand Shukl. 2009. Sanskrit Morphological Analyser: Some Issues. Indian Linguistics 70 (01 2009), 169–177.
Amba Kulkarni and D. Shukl, “Designing a constraint based parser for Sanskrit," SpringerLink, Sanskrit Computational Linguistics pp 70-90, 2010.
Amrith Krishna, Bishal Santra, Pavankumar Satuluri, Sasi Prasanth Bandaru, Bhumi Faldu, Yajuvendra Singh, and Pawan Goyal. 2016. Word Segmentation in Sanskrit Using Path Constrained Random Walks,Proceedings of COLING2016, the 26th International Conference on Computational Linguistics: TechnicalPapers. The COLING 2016 Organizing Committee, Osaka, Japan, 494–504. https://www.aclweb.org/anthology/C16- 1048V.
A. Pawan Goyal and L. Behera, “Analysis of Sanskrit text: parsing and semanticnets," Springerlink, Sanskrit Computational Linguistics pp 200-218, vol. 5402, 2009.R.
Patil, B., and M. Patil. "A review on implementation of Sandhi Viccheda for Sanskrit words." In Proceedings of the international conference in ICGTETM, IJCRT, vol. 5, no. 12, pp. 489-493. 2017.
Patil, B., and M. Patil. "A review on implementation of Sandhi Viccheda for Sanskrit words." In Proceedings of the international conference in ICGTETM, IJCRT, vol. 5, no. 12, pp. 489-493. 2017.
Bhadra, Manji, Surjit Kumar Singh, Sachin Kumar, Subash, Muktanand Agrawal, R. Chandrasekhar, Sudhir K. Mishra, and Girish Nath Jha. "Sanskrit analysis system (SAS)." In Sanskrit Computational Linguistics: Third International Symposium, Hyderabad, India, January 15-17, 2009. Proceedings, pp. 116-133. Springer Berlin Heidelberg, 2009.
Hellwig, Oliver. "Using Recurrent Neural Networks for joint compound splitting and Sandhi resolution in Sanskrit." In 4th Biennial workshop on less-resourced languages. 2015.
Sachin Kumar, “Sandhi Splitter and Analyzer for Sanskrit”, With Special Reference to aC Sandhi, Special Centre for Sanskrit Studies, Jawaharlal Nehru University, New Delhi, 2007 http://sanskrit.jnu.ac.in/rstudents/mphil/sachin.pdf
Cho, Kyunghyun, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. "Learning phrase representations using RNN encoder-decoder for statistical machine translation." arXiv preprint arXiv:1406.1078 (2014).
Xue, Nianwen. "Chinese word segmentation as character tagging." In International Journal of Computational Linguistics & Chinese Language Processing, Volume 8, Number 1, February 2003: Special Issue on Word Formation and Chinese Language Processing, pp. 29-48. 2003.
Shree, M. Rajani, Sowmya Lakshmi, and B. R. Shambhavi. "A novel approach to Sandhi splitting at Character level for Kannada Language." In 2016 International Conference on Computation System and Information Technology for Sustainable Solutions (CSITSS), pp. 17-20. IEEE, 2016.
Rahul Aralikatte, Neelamadhav Gantayat, Naveen Panwar, Anush Sankaran, and Senthil Mani. 2018. Sanskrit Sandhi Splitting using seq2(seq)2. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Brussels, Belgium4909–4914. https://doi.org/10.18653/v1/D18-1530.
Sushant Dave, Arun Kumar Singh, Dr. Prathosh A.P., and Prof. Brejesh Lall. 2021. Neural Compound-Word (Sandhi) Generation and Splitting in Sanskrit Language. In 8th ACM IKDD CODS and 26th COMAD (CODS COMAD 2021). Association for Computing Machinery, New York, NY, USA, 171–177. https://doi.org/10.1145/3430984.3431025
Premjith B, Chandni Chandran V, Shriganesh Bhat, Soman Kp, and Prabaharan P. 2019. A Machine Learning Approach for Identifying Compound Words from a Sanskrit Text. In Proceedings of the 6th International Sanskrit Computational Linguistics Symposium, pages 45–51, IIT Kharagpur, India. Association for Computational Linguistics..
Oliver Hellwig and Sebastian Nehrdich. 2018. Sanskrit Word Segmentation Using Character-level Recurrent and Convolutional Neural Networks. In Proceedingsof the 2018 Conference on Empirical Methods in Natural Language Processing.Association for Computational Linguistics, Brussels, Belgium, 2754–2763. https://doi.org/10.18653/v1/D18-1295 .
S. Sharma and M. Nirkhe, "Sanskrit Sandhi Splitting using LSTM-based Sequence-to-Sequence Models," in Proceedings of the International Conference on Natural Language Processing (ICON), 2018.K.
Patel and R. Singh, "Transformer-based Models for Sanskrit Sandhi Splitting," in Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), 2020
A. Deshmukh and P. Joshi, "Bidirectional LSTM-CRF Models for Sanskrit Sandhi Splitting," in Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 2019.
N. Gupta and S. Kumar, "BERT-based Models for Sanskrit Sandhi Splitting," in Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2021.
R. Mishra et al., "Hybrid Models for Sanskrit Sandhi Splitting," in Proceedings of the IEEE International Conference on Computational Linguistics (COLING), 2019
