Digital pathology generates large whole slide images (WSIs), with individual files often exceeding several gigabytes and research cohorts scaling to tens or hundreds of terabytes. These data volumes, combined with compute-intensive deep learning workflows, necessitate access to powerful high-performance computing (HPC) infrastructures for efficient model training and experimentation. We use the VSC-5 and MUSICA clusters to advance generative modeling in computational pathology, demonstrating scalable distributed training of diffusion models on datasets with varied tissue morphologies. Our work is funded by the European Union through the projects RI-SCALE and OSCARS (Grant Agreement Numbers 10188168 and 101129751). Our application focuses on diffusion-based image synthesis for histopathology patches, enabling data augmentation for tasks like tumor classification and survival prediction, while meeting strict privacy requirements, through model sanitization via Differential Privacy (DP) and evaluation against membership inference attacks. Such augmentation is essential as real pathology data is limited by class imbalances, scanner variability, and privacy constraints. Generated patches realistically mimic tissue textures and colors. These models, with hundreds of millions of parameters, require substantial GPU memory usage and parallelization to achieve stable training with large effective batch sizes. We train on WSI-derived 256×256 patches, processing millions of samples. To bridge application needs with infrastructure capabilities, we optimized our workflows for MUSICA's GPU partition, enabling memory-efficient patch loading and communication overlap during training. The nodes of the GPU partition of MUSICA (zen4\₀768\ₕ100x4) have 2× AMD EPYC 9654 CPUs (192 cores total), 768 GB DDR5 RAM, 4× NVIDIA H100 SXM5 GPUs (94 GB memory per GPU), 7. 68 TB local NVMe storage, and 4× NDR200 InfiniBand links enabling efficient multi-node communication. Our distributed training implementation on MUSICA utilizes PyTorch's Distributed Data Parallel (DDP) framework with the NCCL communication backend, where we launch exactly one training process per GPU to ensure optimal resource utilization. This approach enables seamless scaling from single-node configurations utilizing all 4 GPUs per node up to multi-node deployments spanning 4 nodes with a total of 16 H100 GPUs. Scaling experiments on representative diffusion model training workloads for large digital pathology datasets clearly demonstrate efficient performance, achieving approximately 75% parallel efficiency when scaling from 1 to 16 GPUs across multiple nodes, primarily limited by all-reduce communication overhead and batch size scaling effects. These results underscore MUSICA's effectiveness for data-intensive AI workloads in biomedical research.
Saurugger et al. (Tue,) studied this question.