Using Wikipedia for Co-clustering Based Cross-Domain Text Classification

Pisa(2008)

引用 60|浏览0
暂无评分
摘要
Traditional approaches to document classification requires labeled data in order to construct reliable and accurate classifiers. Unfortunately, labeled data are seldom available, and often too expensive to obtain. Given a learning task for which training data are not available, abundant labeled data may exist for a different but related domain. One would like to use the related labeled data as auxiliary information to accomplish the classification task in the target domain. Recently, the paradigm of transfer learning has been introduced to enable effective learning strategies when auxiliary data obey a different probability distribution. A co-clustering based classification algorithm has been previously proposed to tackle cross-domain text classification. In this work, we extend the idea underlying this approach by making the latent semantic relationship between the two domains explicit. This goal is achieved with the use of Wikipedia. As a result, the pathway that allows to propagate labels between the two domains not only captures common words, but also semantic concepts based on the content of documents. We empirically demonstrate the efficacy of our semantic-based approach to cross-domain classification using a variety of real data.
更多
查看译文
关键词
wikipedia,statistical distributions,pattern clustering,classification algorithm,learning (artificial intelligence),co-clustering,document classification,transfer learning,auxiliary information,learning strategy,web sites,cross-domain text classification,effective learning strategy,probability distribution,classification,text analysis,co-clustering-based cross-domain text classification,different probability distribution,auxiliary data,classification task,latent semantic relationship,training data,internet,co clustering,data mining,electronic publishing,information services,encyclopedias,learning artificial intelligence,clustering algorithms
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要