Volume electron microscopy (vEM) provides nanometer-scale 3D imaging, yet its axial (z) resolution is often much lower than the in-plane (xy) resolution, yielding anisotropic volumes that hinder segmentation and connectomic reconstruction. We present a two-stage cross-axial super-resolution framework for isotropic reconstruction that combines a conditional diffusion model and domain-specific self-supervised pretraining of a vision transformer (ViT). First, the student–teacher self-distillation paradigm of DINOv3 is adopted to learn representations from large sets of high-resolution xy sections, capturing vEM-specific texture statistics and ultrastructural patterns. Second, a conditional diffusion denoiser is trained with supervised anisotropic degradation simulated by z-downsampling, while a perceptual loss based on frozen ViT feature distances constrains generated slices to match real-section distributions. These constraints recover axial high-frequency details and reduce hallucinated textures and inter-slice drift, improving cross-slice consistency. Experiments on two public vEM datasets show improved fidelity, perceptual quality, and membrane-boundary continuity over interpolation and learning-based baselines.
Qiu et al. (Thu,) studied this question.