What does this research mean for the field?

A hybrid automated scoring engine combining BERT semantic similarity and TF-IDF cosine similarity substantially outperforms TF-IDF-only baselines in evaluating paraphrased handwritten examination answers. Novelty: ClaimNovelty.METHODOLOGICAL. Consensus alignment: ConsensusAlignment.NEUTRAL.

What question did this study set out to answer?

This research aims to automate the evaluation of handwritten examination scripts to reduce subjectivity and improve scalability.

May 18, 2026Open Access

ExamAI: An Automated Examination Script Evaluation System Using Optical Character Recognition and Transformer-Based Natural Language Processing

Key Points

This research aims to automate the evaluation of handwritten examination scripts to reduce subjectivity and improve scalability.
Developed ExamAI as a full-stack web application with OCR and NLP functionalities.
Utilized a BERT semantic similarity model combined with a TF-IDF cosine similarity component for scoring.
Conducted empirical scoring analysis on 60 descriptive answer pairs.
The dual-model approach improved scoring accuracy for paraphrased correct answers from 10–35% to 37–62%.
Empirical analysis shows significant performance gains over TF-IDF only baselines.

Abstract

The evaluation of handwritten examination scripts is a labour-intensive process that introduces inter-evaluator subjectivity and limits institutional scalability as student enrollment grows. This paper presents ExamAI, a full-stack web application that automates the complete examination lifecycle encompassing scheduling, handwritten answer image submission, optical character recognition (OCR) text extraction, natural language processing (NLP) scoring, and result publication. The scoring engine combines a BERT semantic similarity model weighted at sixty percent with a TF-IDF cosine similarity component weighted at forty percent. The system is implemented as three independently deployable microservices: a React.js frontend, a Node.js Express backend, and a Python Flask machine learning service backed by MongoDB. Empirical scoring analysis across 60 descriptive answer pairs demonstrates that the dual-model approach substantially outperforms TF-IDF-only baselines for paraphrased correct answers, raising combined scores from 10–35% to 37–62%.

ExamAI: An Automated Examination Script Evaluation System Using Optical Character Recognition and Transformer-Based Natural Language Processing

Key Points

Abstract

Cite This Study