Diversifying Restricted Boltzmann Machine For Document Modeling
KDD '15: The 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining Sydney NSW Australia August, 2015(2015)
摘要
Restricted Boltzmann Machine (RBM) has shown great effectiveness in document modeling. It utilizes hidden units to discover the latent topics and can learn compact semantic representations for documents which greatly facilitate document retrieval, clustering and classification. The popularity (or frequency) of topics in text corpora usually follow a power-law distribution where a few dominant topics occur very frequently while most topics (in the long-tail region) have low probabilities. Due to this imbalance, RBM tends to learn multiple redundant hidden units to best represent dominant topics and ignore those in the long-tail region, which renders the learned representations to be redundant and non-informative. To solve this problem, we propose Diversified RBM (DRBM) which diversifies the hidden units, to make them cover not only the dominant topics, but also those in the long-tail region. We define a diversity metric and use it as a regularizer to encourage the hidden units to be diverse. Since the diversity metric is hard to optimize directly, we instead optimize its lower bound and prove that maximizing the lower bound with projected gradient ascent can increase this diversity metric. Experiments on document retrieval and clustering demonstrate that with diversification, the document modeling power of DRBM can be greatly improved.
更多查看译文
关键词
Diversified Restricted Boltzmann Machine,Diversity,Power-law Distribution,Document Modeling,Topic Modeling
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络