Sense Unveiled: Enhancing Urdu Corpus for Nuanced Word Sense Disambiguation

Sarfraz Bibi,Sohail Asghar,Muhammad Zubair

IEEE Access（2024）

Cited 0|Views0

No score

Abstract

Ambiguity in word meanings presents a significant challenge in natural language processing, necessitating robust techniques for Word Sense Disambiguation (WSD). While research in WSD has predominantly focused on widely spoken languages like English and Spanish, less attention has been given to languages such as Urdu. This paper addresses this gap by conducting a thorough examination of existing corpora for WSD in Urdu and presenting the creation of an Enhanced Urdu (EU) corpus specifically tailored for WSD tasks. The analysis encompasses a critical evaluation of the limitations of ULS-WSD-18 Corpus, and justifies the need for a more comprehensive resource. The EU corpus is meticulously curated, comprising 960 words categorized based on their frequency in the corpus into most frequent, moderate, and infrequent words. These words serve as the foundation for constructing sentences utilized in model training and testing. Various similarity coefficients are employed to assess the similarity between the EU corpus and the ULS-WSD-18 Corpus, revealing notable patterns in word occurrences, sense structures, and sentence compositions. The findings underscore the potential of the EU corpus to advance WSD research in Urdu language processing. By providing a comprehensive resource for model development and evaluation, this work contributes to the broader goal of improving language processing tools for Urdu and other underrepresented languages.

Translated text

Key words

Word sense disambiguation,natural language processing,machine learning,sense tagged Urdu corpus

AI Read Science

Must-Reading Tree

Example

Generate MRT to find the research sequence of this paper

Chat Paper

Summary is being generated by the instructions you defined