This article explores how topic modeling can support book index creation, comparing a classic machine learning method (Latent Dirichlet Allocation or LDA) with a large language model (LLM) approach (BERTopic, which builds on SBERT-derived sentence embeddings). Using the public-domain text ‘The cliff ruins of Canyon de Chelly, Arizona’ as a test corpus, the article covers document preprocessing, pipeline construction, and a suite of visualizations that help interpret the latent topics each model discovers. The results show that both techniques generate coherent, interpretable topics, but BERTopic yields richer, more nuanced topic clusters as it leverages larger text spans and preserves grammatical structure. Consequently, the author recommends LLM-based topic modeling over traditional LDA for topic discovery in book indexing.
Donald Howes (Sat,) studied this question.