What question did this study set out to answer?

May 17, 2026Open Access

A hybrid plagiarism detection framework using lexical and semantic similarity with lightweight sentence transformers

Key Points

This study aims to develop a hybrid plagiarism detection framework that combines lexical and semantic similarity techniques.
Utilized a hybrid approach integrating TF-IDF and semantic similarity from sentence transformers.
Fine-tuned MiniLM-based sentence transformer using the PAN 2011 plagiarism detection corpus.
Validated the framework through threshold-based analysis and retrieval of real-world web content.
The hybrid approach significantly improves detection accuracy compared to traditional lexical methods.
Particularly effective in identifying paraphrased plagiarism cases.
Demonstrated a balance between computational efficiency and accuracy in various applications.

Abstract

Plagiarism detection has become increasingly challenging due to the widespread availability of paraphrasing tools and generative artificial intelligence systems. Traditional plagiarism detection techniques based on lexical similarity, such as TF-IDF and n-gram matching, often fail to identify semantically similar but lexically modified text. This paper presents a hybrid plagiarism detection framework that combines lexical similarity measures with semantic similarity derived from sentence transformer models. The proposed approach integrates TF-IDF-based cosine similarity with lightweight sentence embeddings generated using MiniLM and SBERT models. To enhance semantic detection performance, a MiniLM-based sentence transformer is fine-tuned on the PAN 2011 plagiarism detection corpus. Experimental evaluation demonstrates that the hybrid similarity approach significantly improves detection accuracy compared to purely lexical methods, particularly for paraphrased plagiarism cases. The framework is further validated using threshold-based analysis and real-world Web content retrieved through automated scraping. The proposed system provides an efficient and scalable solution for plagiarism detection, balancing computational efficiency with semantic understanding, and is suitable for academic and real-world forensic applications.

Bookmark

View Full Paper

Bookmark

View Full Paper

A hybrid plagiarism detection framework using lexical and semantic similarity with lightweight sentence transformers

Key Points

Abstract

Cite This Study