What type of study is this?

This is a Quantitative Study study.

October 16, 2025Open Access

AICrypto: A Comprehensive Benchmark for Evaluating Cryptography Capabilities of Large Language Models

Puntos clave

LLMs matched human experts in memorizing cryptographic concepts and exploiting vulnerabilities, yet struggled with abstract reasoning.
AICrypto includes 135 questions, 150 CTF challenges, and 18 proof problems to assess various cryptographic skills.
The benchmark is designed with input from cryptography experts to ensure high accuracy and correctness across tasks.
Results indicate that while LLMs show promise, they still need improvement in dynamic analysis and multi-step reasoning.

Resumen

Large language models (LLMs) have demonstrated remarkable capabilities across a variety of domains. However, their applications in cryptography, which serves as a foundational pillar of cybersecurity, remain largely unexplored. To address this gap, we propose AICrypto, the first comprehensive benchmark designed to evaluate the cryptography capabilities of LLMs. The benchmark comprises 135 multiple-choice questions, 150 capture-the-flag (CTF) challenges, and 18 proof problems, covering a broad range of skills from factual memorization to vulnerability exploitation and formal reasoning. All tasks are carefully reviewed or constructed by cryptography experts to ensure correctness and rigor. To support automated evaluation of CTF challenges, we design an agent-based framework. We introduce strong human expert performance baselines for comparison across all task types. Our evaluation of 17 leading LLMs reveals that state-of-the-art models match or even surpass human experts in memorizing cryptographic concepts, exploiting common vulnerabilities, and routine proofs. However, our case studies reveal that they still lack a deep understanding of abstract mathematical concepts and struggle with tasks that require multi-step reasoning and dynamic analysis. We hope this work could provide insights for future research on LLMs in cryptographic applications. Our code and dataset are available at https://aicryptobench.github.io/.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Wang et al. (Sun,) studied this question.

synapsesocial.com/papers/68f0f51d8dd8ea469b1d6fbc — DOI: https://doi.org/10.48550/arxiv.2507.09580

Also consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

CyberMetric: A Benchmark Dataset based on Retrieval-Augmented Generation for Evaluating LLMs in Cybersecurity Knowledge· 2024 · 7 citations
CyberSecEval 2: A Wide-Ranging Cybersecurity Evaluation Suite for Large Language Models· 2024 · 8 citations
An Empirical Evaluation of LLMs for Solving Offensive Security Challenges· 2024 · 4 citations
SECURE: Benchmarking Large Language Models for Cybersecurity· 2024 · 2 citations
NYU CTF Bench: A Scalable Open-Source Benchmark Dataset for Evaluating LLMs in Offensive Security· 2024 · 1 citations

Authors

Yu Wang

Qingdao University of Science and Technology

Yijian Liu

Nanjing Normal University

Lingzhao Ji

Lanzhou Jiaotong University

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

AICrypto: A Comprehensive Benchmark for Evaluating Cryptography Capabilities of Large Language Models

Puntos clave

Resumen

Citation Network

Connected Papers

Discussion

Cite this study

Also consider

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Also consider