Volume - 7 | Issue - 4 | december 2025
Published
02 December, 2025
Word sense disambiguation is the task of determining the exact meaning of a word based on its context. This task is crucial in natural language processing. The lack of labeled datasets and the complex structure of the language, which includes idiomatic usage and subtle semantic changes, contribute to the poor outcomes of earlier attempts to solve word sense disambiguation in Gujarati. As a result, various models have shown low accuracy. To address this issue, we have created a new dataset that is manually sense-annotated for unclear Gujarati words. The corpus contains 50 ambiguous words, and each word has been assigned to the appropriate context. This makes it a valuable starting point for evaluating supervised learning models. With this newly compiled corpus, we carry out a systematic study of two supervised machine learning algorithms-Decision Tree and Random Forest-using 3-fold and 5-fold cross-validation. Our results show that Random Forest obtains the highest accuracy, highlighting which supervised methods are best suited for this particular task. The main contributions of this work include the development of a much-needed annotated corpus and sufficient evidence to prove that supervised learning can be quite effective in improving WSD for Gujarati when proper data is integrated.
KeywordsMachine Learning Word Sense Disambiguation Natural Language Processing Decision Tree Random Forest Sense Annotated Corpus Gujarati Language