The performance of machine learning models depends strongly on how data are represented, especially in domains where inputs are high-dimensional, sparse, sequential, or relational. This dissertation studies how representation learning can improve predictive modeling in two application areas: healthcare and social media. The first study examines sequential Electronic Health Record (EHR) data for predicting lung, breast, cervical, and liver cancers. Because diagnosis histories are high-dimensional and sparse, dimensionality reduction techniques were used to create efficient patient representations for recurrent models, and the best-performing model achieved 88% accuracy, demonstrating the value of compact representations for early disease prediction. The remaining studies address the more difficult task of forecasting post-event social media sentiment using only pre-event activity. A topic-based graph framework achieved 69% accuracy in predicting users’ emotional responses to mass shootings, while PRESTIGE improved this line of work by incorporating sentence-level semantic embeddings into graph-based user modeling. The final two studies further extend this direction. MELT combines topic features and transformer embeddings with time-weighted weak labels, achieving the strongest macro F1 among prior graph-based methods. RISE then introduces recency-aware inputs and confidence-weighted weak supervision, and shows that a graph-free model can outperform earlier graph-based methods by 7 to 11 macro F1 points. Together, these results show that carefully designed representations and supervision strategies can substantially improve predictive performance across diverse domains.
Jovan Andjelkovic (Thu,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: