Abstract Advanced Persistent Threats(APTs) are characterized by persistence and complex attack chains. Information extraction techniques enable the identification of critical knowledge from unstructured Cyber Threat Intelligence (CTI), improving the detection of APT attacks. At present, high-quality information extraction datasets for Chinese APT scenarios remain scarce, particularly those covering multiple tasks such as entity, relation, and event extraction. This shortage limits the training and performance improvement of detection models. To address this issue, a multi-task information extraction dataset for Chinese APT Cyber Threat Intelligence is proposed. The dataset complies with the STIX 2.1 standard and is derived from 116 CTI reports. It covers three tasks: entity, relation, and event extraction. Specifically, it includes 2,574 entities, 1,506 relations, and 139 event instances across 808 sentences. Compared with existing APT threat intelligence datasets, our dataset offers significant advantages in task coverage, annotation granularity, and structural hierarchy. The dataset is further validated using several baseline models. This work provides strong support for APT intelligence modeling and cybersecurity research.
Sun et al. (Mon,) studied this question.