Sequential Embedding Induced Text Clustering, a Non-parametric Bayesian Approach

November 29, 2018 · Declared Dead · 🏛 Pacific-Asia Conference on Knowledge Discovery and Data Mining

"No code URL or promise found in abstract"

Evidence collected by the PWNC Scanner

Authors Tiehang Duan, Qi Lou, Sargur N. Srihari, Xiaohui Xie arXiv ID 1811.12500 Category cs.LG: Machine Learning Cross-listed cs.IR, stat.ML Citations 7 Venue Pacific-Asia Conference on Knowledge Discovery and Data Mining Last Checked 3 months ago

Abstract

Current state-of-the-art nonparametric Bayesian text clustering methods model documents through multinomial distribution on bags of words. Although these methods can effectively utilize the word burstiness representation of documents and achieve decent performance, they do not explore the sequential information of text and relationships among synonyms. In this paper, the documents are modeled as the joint of bags of words, sequential features and word embeddings. We proposed Sequential Embedding induced Dirichlet Process Mixture Model (SiDPMM) to effectively exploit this joint document representation in text clustering. The sequential features are extracted by the encoder-decoder component. Word embeddings produced by the continuous-bag-of-words (CBOW) model are introduced to handle synonyms. Experimental results demonstrate the benefits of our model in two major aspects: 1) improved performance across multiple diverse text datasets in terms of the normalized mutual information (NMI); 2) more accurate inference of ground truth cluster numbers with regularization effect on tiny outlier clusters.