What question did this study set out to answer?

April 19, 2026Open Access

Optimized Machine Learning Pipeline for Lung Cancer Classification: Feature Reduction and Hyperparameter Tuning

Puntos clave

This research aims to create a fully optimized machine learning pipeline for lung cancer classification using clinical data.
Analyzed a dataset of 309 patient records with demographic and clinical attributes.
Preprocessed data and split it into training (80%) and testing (20%) sets.
Performed feature selection using metaheuristic algorithms like Red Deer Optimization and Gray Wolf Optimization.
Trained six ML classifiers with optimized hyperparameters.
Evaluated model performance using accuracy, precision, recall, F1-score, and ROC–AUC.
Logistic Regression achieved the highest accuracy of 91.07% with an AUC of 0.91.
Gradient Boosting and Random Forest both had an accuracy of 87.5%.
The optimized pipeline significantly outperformed more complex ensemble models.

Resumen

Background: Lung cancer remains one of the leading causes of cancer-related mortality worldwide, primarily due to late diagnosis. Although machine learning (ML) techniques have been widely applied for lung cancer classification, many studies lack a fully optimized end-to-end pipeline using routine clinical data. This study proposes an optimized ML framework that integrates demographic, lifestyle, and clinical features with systematic hyperparameter tuning to improve classification performance. Methods: A dataset of 309 patient records containing demographic, lifestyle, and clinical attributes was used. The data were preprocessed and split into training and testing sets in an 80:20 ratio. Feature selection was performed using metaheuristic algorithms, including Red Deer Optimization, Binary Grasshopper Optimization, Gray Wolf Optimization, and Bee Colony Optimization. Six ML classifiers—Logistic Regression, Support Vector Classifier, Gradient Boosting, Random Forest, K-Nearest Neighbors, and Gaussian Naive Bayes—were trained with optimized hyperparameters. Model performance was evaluated using accuracy, precision, recall, F1-score, and ROC–AUC. Results: The optimized pipeline significantly improved classification performance. Logistic Regression achieved the highest accuracy of 91.07% with an AUC of 0.91, outperforming more complex ensemble models. Gradient Boosting and Random Forest both achieved an accuracy of 87.5%, while other classifiers demonstrated moderate performance. Conclusions: The proposed optimized ML pipeline enhances lung cancer classification accuracy using routine clinical data. The results highlight that simpler, well-optimized models can outperform complex approaches on structured datasets. This framework shows strong potential for early lung cancer risk screening and clinical decision support, although further validation on larger datasets is recommended.

Leer artículo completoexternamente

Me gusta

Guardar

Ver artículo completo