What question did this study set out to answer?

The study aims to develop a method for reconstructing high-fidelity 3D models of cultural relics from a single 2D image.

March 13, 2026Open Access

From 2D to 3D: A Generative Model from Single Image to Digital 3D of Chinese Three Gorges Cultural Relics

Key Points

The study aims to develop a method for reconstructing high-fidelity 3D models of cultural relics from a single 2D image.
Developed a generative framework for 3D reconstruction.
Utilized a transformer-based image-to-triplane architecture.
Employed a vision transformer encoder for feature extraction.
Implemented a neural radiance field for geometry synthesis.
Evaluated on a dataset of Chinese Three Gorges cultural relics.
Achieved structurally coherent and visually consistent 3D reconstructions.
Demonstrated superior accuracy and generalization compared to existing methods.
Preserved morphological characteristics under limited data conditions.

Abstract

The acquisition of high-quality three-dimensional (3D) models of cultural relics often relies on expensive scanning equipment or multi-view image capture, which limits large-scale deployment in real-world heritage conservation scenarios. Large-scale water impoundment in the Three Gorges region has resulted in the permanent submergence of numerous cultural relics and archaeological remains. For many of these artifacts, only a single two-dimensional image remains as the sole visual record, posing significant challenges for reconstructing their original three-dimensional geometry and appearance. This limitation renders traditional multi-view reconstruction and physical scanning methods infeasible. To address this challenge, we propose a generative framework for reconstructing high-fidelity 3D digital models of Chinese Three Gorges cultural relics from a single two-dimensional (2D) image. Building upon recent advances in generative 3D representation learning, the proposed method adopts a transformer-based image-to-triplane architecture to infer an implicit 3D representation directly from a single RGB image. A vision transformer encoder is employed to extract global and local visual features, which are subsequently projected into a compact triplane representation through a cross-attention-based decoder. The reconstructed triplane features are further decoded by a neural radiance field (NeRF) to synthesize dense geometry and appearance, enabling accurate mesh extraction and novel-view rendering. To enhance robustness under in-the-wild conditions, the model implicitly estimates camera parameters during inference without relying on explicit calibration information. The proposed method is evaluated on a dataset of Chinese Three Gorges cultural relics, covering diverse artifact categories and visual styles. Experimental results demonstrate that the proposed framework is capable of producing structurally coherent and visually consistent 3D reconstructions from a single image, effectively preserving key morphological characteristics of cultural relics under limited data conditions. Compared with existing single-image and multi-view reconstruction baselines, the proposed framework exhibits better reconstruction accuracy, visual consistency, and generalization capability. This study provides an efficient and scalable solution for the digital reconstruction of cultural relics and offers a practical pathway for large-scale 3D digitization of heritage artifacts from archival images. This work provides a practical solution for the digital reconstruction of submerged heritage artifacts and contributes to the application of generative 3D modeling techniques in cultural heritage preservation and restoration.

From 2D to 3D: A Generative Model from Single Image to Digital 3D of Chinese Three Gorges Cultural Relics

Key Points

Abstract

Cite This Study