Key points are not available for this paper at this time.
In a situation where effective analysis and understanding are emerging as important tasks due to the rapidly increasing amount and diversity of data in modern society, this paper identifies problems with data quality and consistency through clustering public data and seeks ways to improve them.The data used was general in physical education provided by data.go.kr.Nouns were extracted from text data and clustering was performed using TF-IDF vectorization.Performance was evaluated by comparing Kmeans, DBSCAN, and GMM algorithms and keyword extraction methods, and problems with data consistency and quality were analyzed.As a result of the study, it was confirmed that stopword processing and choice of keyword extraction method had a significant impact on clustering results.Additionally, data length, format, and keyword quality affect clustering performance.It was concluded that data imbalance, lack of consistency, and lack of standards can affect clustering results and that standardized guidelines and research are needed to solve these problems.We identify the diversity of data through clustering, suggest ways to improve data collection and analysis strategies through this, and emphasize the importance of improving data quality and active use of clustering techniques for the effective use of public data.
Hong et al. (Wed,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: