What question did this study set out to answer?

This research compares the effectiveness of BERTopic and LDA in modeling topics from Korean sleep health discourse on social media.

March 13, 2026

A Comparative Analysis of BERTopic and LDA for Topic Modeling of Korean Sleep Health Discourse on Social Media

Key Points

This research compares the effectiveness of BERTopic and LDA in modeling topics from Korean sleep health discourse on social media.
Collected 8,002 blog posts from Naver using 9 sleep-related keywords from March to October 2025.
Applied both BERTopic and LDA to the same dataset for topic modeling.
Evaluated performance based on metrics like the number of topics, noise ratio, distribution entropy, and topic coherence.
BERTopic identified 9 topics with a noise ratio of 22.8%.
LDA identified 6 effective topics with a noise ratio of 0.9%.
BERTopic achieved a higher distribution uniformity (0.852) compared to LDA (0.804).
LDA's coherence score (C_V) was 0.5287.
BERTopic's 'melatonin/hormone' topic showed an 84.1% alignment with LDA's corresponding topic.

Abstract

본 연구는 한국어 수면 건강 관련 소셜미디어 텍스트의 토픽 모델링을 위해 BERTopic과 잠재 디리클레 할당 (LDA) 의 성능을 비교 분석하였다. 2025년 3월부터 10월까지 네이버에서 9개의 수면 관련 키워드로 총 8, 002개의 블로그 게시물을 수집하였다. 동일한 데이터셋에 두 방법론을 적용하고, 토픽 수, 노이즈 비율, 분포 엔트로피, 토픽 일관성 등의 지표로 성능을 평가하였다. 분석 결과, BERTopic은 9개의 토픽을 도출하며 22. 8%의 노이즈 비율을 보인 반면, LDA는 6개의 유효 토픽을 도출하며 0. 9%의 낮은 노이즈 비율을 나타냈다. BERTopic은 LDA (0. 804) 보다 높은 분포 균등성 (0. 852) 을 보여 더 균형 잡힌 토픽 할당을 수행하였다. LDA의 일관성 점수 (CV) 는 0. 5287이었다. 교차분석 결과, BERTopic의 '멜라토닌/호르몬' 토픽은 LDA의 해당 토픽과 84. 1%의 일치율을 보여 잘 정의된 주제에서 높은 일관성을 나타냈다. 본 연구는 한국어 건강 관련 텍스트 분석을 위한 토픽 모델링 방법론 선택에 실질적인 지침을 제공한다.

Bookmark

Cite This Study

JongHwi Song (Fri,) studied this question.

synapsesocial.com/papers/69b3ab0002a1e69014ccbb6d https://doi.org/https://doi.org/10.9708/jksci.2026.31.02.231

Bookmark