The performance of machine learning models depends strongly on how data are represented, especially in domains where inputs are high-dimensional, sparse, sequential, or relational. This dissertation studies how representation learning can improve predictive modeling in two application areas: healthcare and social media. The first study examines sequential Electronic Health Record (EHR) data for predicting lung, breast, cervical, and liver cancers. Because diagnosis histories are high-dimensional and sparse, dimensionality reduction techniques were used to create efficient patient representations for recurrent models, and the best-performing model achieved 88% accuracy, demonstrating the value of compact representations for early disease prediction. The remaining studies address the more difficult task of forecasting post-event social media sentiment using only pre-event activity. A topic-based graph framework achieved 69% accuracy in predicting users’ emotional responses to mass shootings, while PRESTIGE improved this line of work by incorporating sentence-level semantic embeddings into graph-based user modeling. The final two studies further extend this direction. MELT combines topic features and transformer embeddings with time-weighted weak labels, achieving the strongest macro F1 among prior graph-based methods. RISE then introduces recency-aware inputs and confidence-weighted weak supervision, and shows that a graph-free model can outperform earlier graph-based methods by 7 to 11 macro F1 points. Together, these results show that carefully designed representations and supervision strategies can substantially improve predictive performance across diverse domains.
Jovan Andjelkovic (Thu,) studied this question.