News and reports frequently drive future trends, yet traditional Time Series Forecasting often fails to capture these external influences. To integrate textual insights, we introduce Text-Time Cross-Modal Attention (TTCA), a multimodal framework that fuses numerical embeddings with text embeddings extracted from a pre-trained language model. TTCA employs a cross-attention mechanism that treats time series features as queries and textual features as keys and values. This architecture ensures that semantic context enhances, rather than overshadows, underlying temporal dynamics. Extensive evaluations on the Time-MMD dataset across nine real-world domains demonstrate that TTCA consistently outperforms state-of-the-art unimodal baselines, achieving average improvements of 3.29% in MSE and 9.66% in MAE. Furthermore, TTCA shows moderate performance gains over recent multimodal approaches, particularly in event-driven scenarios.
Anh et al. (Mon,) studied this question.