The degradation of physical student examination archives, particularly handwritten essay booklets, presents a significant barrier to longitudinal academic research, institutional record preservation, and student performance analysis. This study introduces a novel natural language processing (NLP)-based framework for the automated reconstruction of damaged academic essay manuscripts using a span-infilling transformer architecture. A synthetic dataset comprising 5000 paired samples of damaged Text and full Text was curated from archived Data Science examination scripts collected at the Center for Applied Data Science, Sol Plaatje University, South Africa. The proposed method fine-tunes a T5-based encoder–decoder model, leveraging span corruption and task-specific prompting to restore missing or illegible segments. Comprehensive evaluation using ROUGE-L, BLEU-4, and BERTScore demonstrates substantial improvements over baseline models including BERT and GPT-2. Qualitative assessments by academic experts further validate the fluency, coherence, and contextual relevance of restored texts. Training dynamics reveal stable convergence without overfitting, while ablation studies confirm the contribution of each architectural component. Token-level error analyses and confidence-scored predictions provide additional interpretability. The proposed framework offers a scalable and effective solution for educational institutions seeking to digitize and recover lost historical student essay records, with potential extensions to other domains, such as digital humanities and archival restoration.
Olaniyan et al. (Mon,) studied this question.