November 8, 2025

Efficient Classification of Human-Generated vs. Machine-Generated Text Using Lightweight Machine Learning Models

Key Points

Random Forest achieved top performance with accuracy up to 0.98, showcasing the strength of feature extraction.
SHAP-based explainability indicated key factors like readability metrics and punctuation usage as major classification drivers.
Model evaluation involved feature extraction from diverse sources, including Named Entity Recognition and lexical richness qualities.
These findings suggest lightweight models are reliable for verifying content authenticity in decision-support systems.

Abstract

This study presents an efficient and interpretable approach to distinguishing human-authored text from machine-generated content using traditional machine learning techniques, thereby avoiding the computational demands of transformer-based classifiers. Two datasets were employed to ensure generalizability: (1) ROCStories narratives paired with continuations generated by FALCON-7B under three creativity settings, and (2) short news articles from The Indian Times and The Guardian continued by LLaMA-7B under identical settings. Preprocessing involved Minimal Text Cleaning (MTC) and Advanced Text Normalization (ATN), followed by feature extraction from TF-IDF, Partof-Speech distributions, Named Entity Recognition, readability indices, lexical richness, n-gram frequencies, sentiment polarity, punctuation usage, and syntactic complexity. Random Forest consistently achieved top performance (accuracy up to 0.98, AUC/ROC up to 0.99), outperforming the Naïve Bayes baseline. To enhance transparency, SHAP-based explainability was applied, revealing that readability metrics, lexical richness, unigrams, and linguistic structures (POS and NER) were the strongest drivers of classification across both datasets. For comparison, GPT-4o and GPT-3.5-Turbo, tested in zero-shot mode, achieved a maximum accuracy of 0.68. These results highlight not only the robustness and computational efficiency of feature-engineered models but also their interpretability, suggesting their value as lightweight, transparent, and reliable components in decision-support systems where content authenticity verification is critical.

Bookmark

Efficient Classification of Human-Generated vs. Machine-Generated Text Using Lightweight Machine Learning Models

Key Points

Abstract

Cite This Study