What question did this study set out to answer?

The study explores how well different machine learning methods classify Hebrew news articles despite limited labeled data.

April 19, 2026Open Access

Multi-Task Classification of Hebrew News Articles: A Comparative Study of Classical ML and BERT Models in a Morphologically Rich, Low-Resource Setting

Key Points

The study explores how well different machine learning methods classify Hebrew news articles despite limited labeled data.
Evaluated multi-task classification across four dimensions: domain, sentiment, gender, and source.
Used a feature space of 2149 stylistic and content-based attributes.
Applied Hill-Climbing selection to optimize feature usage.
Contrasted five classical machine learning models with five BERT-based models.
Implemented oversampling strategies to address class imbalance.
The performance gap between classical ML and deep learning models was minimal under data scarcity.
Stylistic features enhanced stability and interpretability in classification tasks.
Established a benchmark for Hebrew news classification with a curated dataset.

Abstract

The automated classification of Hebrew, a morphologically rich language (MRL), presents unique challenges, particularly when high-quality labeled data are scarce. This study investigates the interplay between handcrafted feature engineering and transformer-based representations in a low-resource news classification setting (n = 306). We evaluate a multi-task classification across four distinct dimensions: domain, sentiment, gender, and source. Our methodology employs an extensive feature space of 2149 stylistic and content-based attributes, optimized through a systematic Hill-Climbing selection process. We contrast five classical machine learning architectures with five BERT-based models, integrating five oversampling strategies to mitigate class imbalance. The results reveal that in scenarios of extreme data scarcity, the performance gap between deep learning and optimized classical ML becomes marginal, with stylistic features providing critical stability and interpretability. This study contributes a curated Hebrew news dataset and establishes a robust benchmark, demonstrating that linguistically aware feature engineering remains a vital component for MRL processing when large-scale fine-tuning is impractical.

Read Full Paperexternally

AI에게 질문

Bookmark

View Full Paper