Detection Of A New Class In A Huge Corpus Of Text Documents Through Semi-Supervised Learning

2016 INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING, COMMUNICATIONS AND INFORMATICS (ICACCI)(2016)

引用 1|浏览3
暂无评分
摘要
This paper poses a new problem of detecting an unknown class present in a text corpus which has huge amount of unlabeled samples but a very small quantity of labeled samples. A simple yet efficient solution has also been proposed by modifying conventional clustering technique to demonstrate the scope of the problem for further research. A novel way to estimate cluster diameter is proposed which in turn has been used as a measure to estimate the degree of dissimilarity between two clusters. The main idea of the model is to arrive at a cluster of unlabeled text samples which is far away from any of the labeled clusters guided by few rules such as diameter of the cluster and dissimilarity between pair of clusters. This work is first of its kind in the literature and has tremendous applications in text mining tasks. In fact the model proposed is a general framework which can be applied onto any application which necessarily involves identification of unseen classes in a semi supervised learning environment. The model has been studied with extensive empirical analysis on different text datasets created from the benchmarking 20Newsgroups dataset. The results of the experimentation have revealed the capabilities of the proposed approach and the possibilities for future research.
更多
查看译文
关键词
Text Categorization,Semi-Supervised Learning,Clustering,Unknown Class Detection,Text Representation,Term Weighting
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要