A network approach to topic models

August 04, 2017 Β· Entered Twilight Β· πŸ› Science Advances

πŸŒ… TWILIGHT: Old Age
Predates the code-sharing era β€” a pioneer of its time

"No code URL or promise found in abstract"
"Derived repo from GitHub Pages (backfill)"

Evidence collected by the PWNC Scanner

Repo contents: .gitignore, .gitmodules, .nojekyll, .travis.yml, LICENSE, MANIFEST.in, README.rst, appveyor.yml, ci_scripts, circle.yml, doc-environment.yml, doc, environment.yml, examples, readthedocs.yml, requirements.txt, setup.cfg, setup.py, topsbm

Authors Martin Gerlach, Tiago P. Peixoto, Eduardo G. Altmann arXiv ID 1708.01677 Category stat.ML: Machine Learning (Stat) Cross-listed cs.CL, physics.data-an, physics.soc-ph Citations 226 Venue Science Advances Repository https://github.com/topsbm/topsbm.github.io ⭐ 28 Last Checked 7 days ago
Abstract
One of the main computational and scientific challenges in the modern age is to extract useful information from unstructured texts. Topic models are one popular machine-learning approach which infers the latent topical structure of a collection of documents. Despite their success --- in particular of its most widely used variant called Latent Dirichlet Allocation (LDA) --- and numerous applications in sociology, history, and linguistics, topic models are known to suffer from severe conceptual and practical problems, e.g. a lack of justification for the Bayesian priors, discrepancies with statistical properties of real texts, and the inability to properly choose the number of topics. Here we obtain a fresh view on the problem of identifying topical structures by relating it to the problem of finding communities in complex networks. This is achieved by representing text corpora as bipartite networks of documents and words. By adapting existing community-detection methods -- using a stochastic block model (SBM) with non-parametric priors -- we obtain a more versatile and principled framework for topic modeling (e.g., it automatically detects the number of topics and hierarchically clusters both the words and documents). The analysis of artificial and real corpora demonstrates that our SBM approach leads to better topic models than LDA in terms of statistical model selection. More importantly, our work shows how to formally relate methods from community detection and topic modeling, opening the possibility of cross-fertilization between these two fields.
Community shame:
Not yet rated
Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

πŸ“œ Similar Papers

In the same crypt β€” Machine Learning (Stat)

R.I.P. πŸ‘» Ghosted

Graph Attention Networks

Petar VeličkoviΔ‡, Guillem Cucurull, ... (+4 more)

stat.ML πŸ› ICLR πŸ“š 24.7K cites 8 years ago
R.I.P. πŸ‘» Ghosted

Layer Normalization

Jimmy Lei Ba, Jamie Ryan Kiros, Geoffrey E. Hinton

stat.ML πŸ› arXiv πŸ“š 12.0K cites 9 years ago