What does this research mean for the field?

The Dueling Double Deep Q-Network (D3QN) framework achieves superior malware classification performance with an F1-score of 99.20% on Microsoft Big2015, 98.64% on BODMAS, and 85.07% on EMBER 2018, while significantly reducing feature dimensionality and computational costs. Novelty: ClaimNovelty.NOVEL_FINDING. Consensus alignment: ConsensusAlignment.CHALLENGES_CONSENSUS.

What question did this study set out to answer?

This research aims to improve malware detection accuracy while reducing computational costs by using adaptive feature selection.

February 21, 2026Open Access

Adaptive malware detection using sequential feature selection: A dueling double deep Q-Network framework for intelligent classification

Key Points

This research aims to improve malware detection accuracy while reducing computational costs by using adaptive feature selection.
Formulated malware classification as a Markov Decision Process with episodic feature acquisition.
Employed Dueling Double Deep Q-Network framework for dynamic feature selection.
Evaluated performance on Microsoft Big2015, BODMAS, and EMBER 2018 datasets with comprehensive ablation studies.
Achieved F1-scores of 99.20% on Microsoft Big2015, 98.64% on BODMAS, and 85.07% on EMBER 2018.
Utilized approximately 60 features on average, resulting in up to 97.5% dimensionality reduction.
Demonstrated 76.08% average recall on unseen EMBER 2024 malware variants, a 27.55% improvement over traditional methods.

Abstract

• Formulates malware classification as a Markov Decision Process with episodic feature acquisition, achieving superior performance across diverse datasets: 99.20% F1-score on Microsoft Big2015, 98.64% on BODMAS, and 85.07% on EMBER 2018 using reinforcement learning. • Demonstrates systematic superiority over traditional approaches through comprehensive ablation studies, where static feature selection methods exhibit severe performance degradation (up to 10.40% F1-score reduction) while D3QN maintains consistent improvements across all evaluation scenarios. • Validates robust transferability with 76.08% average recall on unseen EMBER 2024 malware variants across six diverse file formats, demonstrating 27.55% relative improvement over traditional methods and effective zero-day threat detection capabilities. • Introduces quantitative intelligence assessment framework proving strategic learning behavior with 62.5% categorical preference deviation from random baselines, 57.7% feature specialization, and autonomous discovery of domain-aligned cybersecurity patterns without explicit supervision. Traditional malware detection methods exhibit computational inefficiency due to exhaustive feature extraction requirements, creating accuracy-efficiency trade-offs that limit real-time deployment. We formulate malware classification as a Markov Decision Process with episodic feature acquisition and propose a Dueling Double Deep Q-Network (D3QN) framework for adaptive sequential feature selection. The agent learns to dynamically explore informative features per sample before terminating with classification decisions, optimizing both detection accuracy and computational cost through reinforcement learning. We evaluate our approach on Microsoft Big2015 (9-class, 1795 features), BODMAS and EMBER 2018 (binary, 2381 features) datasets. D3QN achieves 99.20%, 98.64%, and 85.07% F1-scores respectively while utilizing approximately 60 features on average, representing 96.6% and 97.5% dimensionality reduction compared to full feature sets. Comprehensive ablation studies across six feature selection methods demonstrate that traditional approaches suffer severe performance degradation (averaging 1.85-10.40% F1-score reduction) when constrained to comparable feature subsets, while D3QN maintains consistent improvements (+1.38% to +5.08%) across all evaluation scenarios. Cross-dataset transferability validation on EMBER 2024 demonstrates superior zero-day detection capabilities, achieving 76.08% average recall on unseen malware variants across diverse file formats–representing 27.55% relative improvement over traditional methods. Quantitative intelligence assessment reveals strategic learning behavior with 62.5% categorical preference deviation from random baselines and 57.7% feature specialization. The learned policies exhibit autonomous discovery of domain-aligned patterns, identifying structural anomaly indicators and behavioral signatures characteristic of cybersecurity expertise. Our results validate reinforcement learning-based sequential feature selection for malware classification, achieving superior accuracy with substantial computational reduction through learned adaptive policies that outperform static dimensionality reduction techniques across diverse threat landscapes.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Khan et al. (Thu,) studied this question.

synapsesocial.com/papers/69994bef873532290d02012d https://doi.org/https://doi.org/10.1016/j.jisa.2026.104407

Bookmark

View Full Paper