• Formulates malware classification as a Markov Decision Process with episodic feature acquisition, achieving superior performance across diverse datasets: 99.20% F1-score on Microsoft Big2015, 98.64% on BODMAS, and 85.07% on EMBER 2018 using reinforcement learning. • Demonstrates systematic superiority over traditional approaches through comprehensive ablation studies, where static feature selection methods exhibit severe performance degradation (up to 10.40% F1-score reduction) while D3QN maintains consistent improvements across all evaluation scenarios. • Validates robust transferability with 76.08% average recall on unseen EMBER 2024 malware variants across six diverse file formats, demonstrating 27.55% relative improvement over traditional methods and effective zero-day threat detection capabilities. • Introduces quantitative intelligence assessment framework proving strategic learning behavior with 62.5% categorical preference deviation from random baselines, 57.7% feature specialization, and autonomous discovery of domain-aligned cybersecurity patterns without explicit supervision. Traditional malware detection methods exhibit computational inefficiency due to exhaustive feature extraction requirements, creating accuracy-efficiency trade-offs that limit real-time deployment. We formulate malware classification as a Markov Decision Process with episodic feature acquisition and propose a Dueling Double Deep Q-Network (D3QN) framework for adaptive sequential feature selection. The agent learns to dynamically explore informative features per sample before terminating with classification decisions, optimizing both detection accuracy and computational cost through reinforcement learning. We evaluate our approach on Microsoft Big2015 (9-class, 1795 features), BODMAS and EMBER 2018 (binary, 2381 features) datasets. D3QN achieves 99.20%, 98.64%, and 85.07% F1-scores respectively while utilizing approximately 60 features on average, representing 96.6% and 97.5% dimensionality reduction compared to full feature sets. Comprehensive ablation studies across six feature selection methods demonstrate that traditional approaches suffer severe performance degradation (averaging 1.85-10.40% F1-score reduction) when constrained to comparable feature subsets, while D3QN maintains consistent improvements (+1.38% to +5.08%) across all evaluation scenarios. Cross-dataset transferability validation on EMBER 2024 demonstrates superior zero-day detection capabilities, achieving 76.08% average recall on unseen malware variants across diverse file formats–representing 27.55% relative improvement over traditional methods. Quantitative intelligence assessment reveals strategic learning behavior with 62.5% categorical preference deviation from random baselines and 57.7% feature specialization. The learned policies exhibit autonomous discovery of domain-aligned patterns, identifying structural anomaly indicators and behavioral signatures characteristic of cybersecurity expertise. Our results validate reinforcement learning-based sequential feature selection for malware classification, achieving superior accuracy with substantial computational reduction through learned adaptive policies that outperform static dimensionality reduction techniques across diverse threat landscapes.
Khan et al. (Thu,) studied this question.