NeuroFusion-X is a unified, modular framework for end-to-end prediction from heterogeneous real-world data. Many decisions require joint reasoning over time-series, images, and text, yet production systems remain siloed, and naïve early/late fusion misses cross-modal dependencies and temporal alignment. NeuroFusion-X addresses this via: (1) modality-specialized encoders,CNNs for images, a compact transformer for text, and a bidirectional time-series encoder with temporal attention; (2) a cross-modal fusion-attention block that learns instance-wise interactions and down-weights noisy or missing channels; and (3) parameter-efficient bottlenecks and inference-oriented kernels to cut latency without sacrificing accuracy. To evaluate realism and scale while avoiding privacy constraints, we construct a controlled synthetic benchmark of 500k multimodal samples across healthcare, finance, and cybersecurity. Each sample includes a 48-step, 30-variable time-series, a 128×128 image, and a 60–160-token note, with class imbalance, inference-time modality masks, and induced distribution shifts. Across 18 tasks, NeuroFusion-X reaches approximately 97.8% mean accuracy and approximately 0.976 macro-F1, reducing median per-sample inference latency by approximately 35% versus a strong baseline. Robustness holds with ≤1.6% macro-F1 drop under 20% modality dropout and ≤2.2% under light adversarial perturbations. Ablations show fusion-attention, modality-dropout, and domain-adaptive normalization drive reliability. We outline deployment pathways for safety-critical contexts and integration with multimodal LLMs for rationale-grounded predictions.
Ashutosh Agarwal (Sat,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: