This study presents an efficient and interpretable approach to distinguishing human-authored text from machine-generated content using traditional machine learning techniques, thereby avoiding the computational demands of transformer-based classifiers. Two datasets were employed to ensure generalizability: (1) ROCStories narratives paired with continuations generated by FALCON-7B under three creativity settings, and (2) short news articles from The Indian Times and The Guardian continued by LLaMA-7B under identical settings. Preprocessing involved Minimal Text Cleaning (MTC) and Advanced Text Normalization (ATN), followed by feature extraction from TF-IDF, Partof-Speech distributions, Named Entity Recognition, readability indices, lexical richness, n-gram frequencies, sentiment polarity, punctuation usage, and syntactic complexity. Random Forest consistently achieved top performance (accuracy up to 0.98, AUC/ROC up to 0.99), outperforming the Naïve Bayes baseline. To enhance transparency, SHAP-based explainability was applied, revealing that readability metrics, lexical richness, unigrams, and linguistic structures (POS and NER) were the strongest drivers of classification across both datasets. For comparison, GPT-4o and GPT-3.5-Turbo, tested in zero-shot mode, achieved a maximum accuracy of 0.68. These results highlight not only the robustness and computational efficiency of feature-engineered models but also their interpretability, suggesting their value as lightweight, transparent, and reliable components in decision-support systems where content authenticity verification is critical.
Kian Jazayeri (Fri,) studied this question.