What question did this study set out to answer?

The aim is to enhance text analysis methods by applying Confirmatory Factor Analysis (CFA) to keyword-based measurements, ensuring better reliability and validity.

April 16, 2026Open Access

From keyword-based text measures to latent variables: confirmatory factor analysis with word embeddings

Key Points

The aim is to enhance text analysis methods by applying Confirmatory Factor Analysis (CFA) to keyword-based measurements, ensuring better reliability and validity.
Employed Confirmatory Factor Analysis (CFA) on word embeddings to analyze text data.
Constructed a correlation matrix using cosine similarities between embedding vectors.
Evaluated factor loadings, model fit indices, and reliability coefficients.
Tested measurement invariance across different groups and time periods.
Conducted a Monte Carlo simulation to assess the behavior of fit indices with random keyword selection.
Found that applying CFA improved measurement reliability and internal structure of constructs measured by keywords.
Demonstrated measurement invariance across groups and time periods.
Results indicated comparable latent construct intensities which transformed descriptive indicators into formal latent variables.

Abstract

Abstract Dictionary-based text analysis, where researchers select keywords to measure constructs such as public sentiment, anxiety, or political attitudes in large text corpora, is widely used in computational social science. However, keyword selection is rarely subjected to the same psychometric scrutiny applied to survey instruments: studies seldom report reliability, evaluate internal structure, or test whether the measurement holds across subpopulations or time points. Moreover, few existing methods enable the construction of measures that reflect theoretical or expected relationships among keywords. This paper proposes a method that brings these capabilities to text analysis by applying Confirmatory Factor Analysis (CFA) to word embeddings. Keywords are treated as observed indicators of a latent construct, and their semantic relationships, operationalized as centered cosine similarities between embedding vectors, serve as the input correlation matrix for CFA estimation. The framework enables researchers to estimate factor loadings and model fit indices (CFI, TLI, RMSEA, SRMR), compute reliability coefficients (Cronbach’s alpha, Omega), and test measurement invariance across groups or time periods using multigroup models with structured means. Moreover, the method allows researchers to compare latent construct intensity across groups or time periods, transforming keyword-based text measures from descriptive indicators into formally comparable latent variables. The method is demonstrated through an empirical application of the discourse of war anxiety during Russia’s 2022 invasion of Ukraine. A Monte Carlo simulation further examines the behavior of fit indices under random keyword selection. The approach complements existing text analysis methods and can be implemented using standard software, such as the lavaan R package.

Bookmark

View Full Paper