What question did this study set out to answer?

This research aims to develop a comprehensive dataset for extracting critical information related to Chinese Advanced Persistent Threats (APTs).

May 26, 2026Open Access

A multi-task information extraction dataset for Chinese APT cyber threat intelligence

Key Points

This research aims to develop a comprehensive dataset for extracting critical information related to Chinese Advanced Persistent Threats (APTs).
Developed a multi-task dataset for information extraction compliant with STIX 2.1 standard from 116 CTI reports.
Included three tasks: entity, relation, and event extraction with a total of 2,574 entities, 1,506 relations, and 139 event instances across 808 sentences.
Validated the dataset using several baseline models for performance assessment.
Significant improvement in task coverage compared to existing APT threat intelligence datasets.
Enhanced annotation granularity leading to better model training results.
Demonstrated strong structural hierarchy aiding in cybersecurity research.

Abstract

Abstract Advanced Persistent Threats(APTs) are characterized by persistence and complex attack chains. Information extraction techniques enable the identification of critical knowledge from unstructured Cyber Threat Intelligence (CTI), improving the detection of APT attacks. At present, high-quality information extraction datasets for Chinese APT scenarios remain scarce, particularly those covering multiple tasks such as entity, relation, and event extraction. This shortage limits the training and performance improvement of detection models. To address this issue, a multi-task information extraction dataset for Chinese APT Cyber Threat Intelligence is proposed. The dataset complies with the STIX 2.1 standard and is derived from 116 CTI reports. It covers three tasks: entity, relation, and event extraction. Specifically, it includes 2,574 entities, 1,506 relations, and 139 event instances across 808 sentences. Compared with existing APT threat intelligence datasets, our dataset offers significant advantages in task coverage, annotation granularity, and structural hierarchy. The dataset is further validated using several baseline models. This work provides strong support for APT intelligence modeling and cybersecurity research.

Read Full Paperexternally

Ask AI

Helpful

Bookmark

View Full Paper