What question did this study set out to answer?

To investigate the effectiveness of LLM-generated embeddings in preserving data structure and improving predictive modeling in clinical datasets.

March 12, 2026Open Access

Unveiling patterns in clinical data: exploring the role of large language models and clustering algorithms

Key Points

To investigate the effectiveness of LLM-generated embeddings in preserving data structure and improving predictive modeling in clinical datasets.
Applied dimensionality reduction techniques such as PCA and t-SNE.
Used k-means clustering to analyze original vs. LLM-derived datasets.
Evaluated model performance across 100 synthetic datasets and two real clinical datasets.
Assessed multiple LLM architectures focusing on predictive accuracy and efficiency.
BERT embeddings achieved a cosine similarity of 0.95 on linear datasets.
Llama 2 reached 0.85 on quadratic datasets but had higher computational costs.
Predictive performance improved with increases in subject variable ratio, leading to the identification of three performance groups.

Abstract

Objective Large Language Models (LLMs) have shown exceptional performance in natural language processing, yet their utility in structured clinical data analysis remains relatively underexplored. This pilot study investigates whether LLM-generated embeddings can preserve the structural integrity of clinical datasets and enhance predictive modeling, particularly in resource-constrained settings. Methods We applied dimensionality reduction techniques such as Principal Component Analysis (PCA), t-distributed Stochastic Neighbor Embedding (t-SNE), and k-means clustering to compare original data structures with those derived from LLM embeddings. Evaluation metrics included cosine similarity, area under the curve (AUC), and R 2 , applied across 100 synthetic datasets and two real-world clinical datasets: the UCI medical database and endocarditis patient records. We assessed multiple LLM architectures, including BERT, RoBERTa, Llama 2, and E5-small, focusing on predictive accuracy and computational efficiency. Results LLM embeddings closely mirrored original data structures, with BERT achieving a cosine similarity of 0.95 on linear datasets and Llama 2 (30B) reaching 0.85 on quadratic datasets, albeit with higher computational costs. Predictive performance improved significantly across the board with increases in subject variable ratio (SVR), three groups were identified similar performance, assisted better and assisted significantly better. These groups differed based upon the equation used to generate synthetic data. Discussion These findings highlight the potential of LLMs to enhance structured data analysis by identifying optimal conditions, such as SVR thresholds, for their practical use. The trade-off between computational cost and performance across different LLM architectures is also emphasized, suggesting the need for context-specific model selection. Conclusion LLMs can be effectively leveraged to repurpose existing clinical datasets for individualized clinical questions, such as optimizing surgical timing for patients with infective endocarditis and embolic stroke. This approach advances precision medicine and supports data-driven clinical decision-making.

Demander à l'IA

Bookmark

View Full Paper