What question did this study set out to answer?

This research aims to explore how different data representations can enhance predictive accuracy in healthcare and social media contexts.

June 15, 2026Open Access

Data representations in high-dimensional and complex data: predictive modeling in healthcare and social media

Key Points

This research aims to explore how different data representations can enhance predictive accuracy in healthcare and social media contexts.
Analyzed Electronic Health Record data for cancer prediction using dimensionality reduction techniques.
Applied a topic-based graph framework to forecast post-event social media sentiment.
Developed advanced models like MELT and RISE to combine various embeddings and improve accuracy.
Achieved 88% accuracy in predicting cancer diagnoses using compact patient representations.
Predicted emotional responses to mass shootings with 69% accuracy through topic-based approaches.
MELT and RISE models improved predictive performance, surpassing graph-based methods by 7 to 11 macro F1 points.

Abstract

The performance of machine learning models depends strongly on how data are represented, especially in domains where inputs are high-dimensional, sparse, sequential, or relational. This dissertation studies how representation learning can improve predictive modeling in two application areas: healthcare and social media. The first study examines sequential Electronic Health Record (EHR) data for predicting lung, breast, cervical, and liver cancers. Because diagnosis histories are high-dimensional and sparse, dimensionality reduction techniques were used to create efficient patient representations for recurrent models, and the best-performing model achieved 88% accuracy, demonstrating the value of compact representations for early disease prediction. The remaining studies address the more difficult task of forecasting post-event social media sentiment using only pre-event activity. A topic-based graph framework achieved 69% accuracy in predicting users’ emotional responses to mass shootings, while PRESTIGE improved this line of work by incorporating sentence-level semantic embeddings into graph-based user modeling. The final two studies further extend this direction. MELT combines topic features and transformer embeddings with time-weighted weak labels, achieving the strongest macro F1 among prior graph-based methods. RISE then introduces recency-aware inputs and confidence-weighted weak supervision, and shows that a graph-free model can outperform earlier graph-based methods by 7 to 11 macro F1 points. Together, these results show that carefully designed representations and supervision strategies can substantially improve predictive performance across diverse domains.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper