Published on Mon Jun 04 2018

Topic Modelling of Empirical Text Corpora: Validity, Reliability, and Reproducibility in Comparison to Semantic Maps

Tobias Hecking, Loet Leydesdorff

Using the 6,638 case descriptions of societal impact submitted for evaluation in the Research Excellence Framework (REF 2014), we replicate the topic model. Removing a small fraction of documents from the sample has on average a much larger impact on LDA than on PCA-based models. In

0
0
0
Abstract

Using the 6,638 case descriptions of societal impact submitted for evaluation in the Research Excellence Framework (REF 2014), we replicate the topic model (Latent Dirichlet Allocation or LDA) made in this context and compare the results with factor-analytic results using a traditional word-document matrix (Principal Component Analysis or PCA). Removing a small fraction of documents from the sample, for example, has on average a much larger impact on LDA than on PCA-based models to the extent that the largest distortion in the case of PCA has less effect than the smallest distortion of LDA-based models. In terms of semantic coherence, however, LDA models outperform PCA-based models. The topic models inform us about the statistical properties of the document sets under study, but the results are statistical and should not be used for a semantic interpretation - for example, in grant selections and micro-decision making, or scholarly work-without follow-up using domain-specific semantic maps.

Sat Oct 17 2015
Machine Learning
A Historical Analysis of the Field of OR/MS using Topic Models
The study is based on 80,757 published journal abstracts from 37 leading OR/MS journals. We have developed a topic model, using Latent Dirichlet Allocation (LDA), and extend this analysis to reveal the temporalynamics of the field, journals, and topics.
0
0
0
Mon Apr 13 2020
NLP
Keyword Assisted Topic Models
Keyword assisted topic model (keyATM) provides more interpretable results. It is less sensitive to the number of topics than standard topic models. An open-source software package is available for implementing the proposed methodology.
0
0
0
Thu Jan 12 2017
Machine Learning
Prior matters: simple and general methods for evaluating and improving topic quality in topic modeling
Latent Dirichlet Allocation (LDA) models trained without stopword removal often produce topics with high posterior probabilities on uninformative words. We propose an additional topic quality metric that targets the stopword problem. We also propose a simple-to-implement strategy for generating topics of higher quality.
0
0
0
Tue Apr 06 2021
Machine Learning
Exploring Topic-Metadata Relationships with the STM: A Bayesian Approach
0
0
0
Wed Feb 25 2015
Machine Learning
Topic-adjusted visibility metric for scientific articles
Measuring the impact of scientific articles is important for evaluating the research output of individual scientists, academic institutions and journals. While citations are raw data for constructing impact measures, there exist potential issues if factors affecting citation patterns are not properly accounted for. In this work, we address the problem of field
0
0
0
Thu Mar 29 2018
NLP
Computer-Assisted Text Analysis for Social Science: Topic Models and Beyond
topic models are a family of statistical-based algorithms to summarize, explore and index large collections of text documents. After a decade of computer scientists, topic models have spread to social science.
0
0
0