What question did this study set out to answer?

The aim is to propose a two-stage attack framework targeting Korean language models, leveraging their unique properties.

March 13, 2026

A Novel Two-Stage Attacks on Korean Language Models: Single-Token Triggers Search and Morphology-Preserving Minimal Edits

Key Points

The aim is to propose a two-stage attack framework targeting Korean language models, leveraging their unique properties.
Initial stage involves universal adversarial trigger search using gradient information for single-token attacks.
Second stage targets samples failed in the first stage with morphology-preserving minimal edits.
Evaluated using NSMC dataset employing KoBERT and KoELECTRA models.
High attack success rate achieved, with KoBERT at 0.963 and KoELECTRA at 0.940.
Effective use of triggers placed at the end of sentences due to Korean sentence structure.
Words that express emotions indirectly functioned as strong triggers.

Abstract

본 연구에서는 교착어의 특성을 가지는 한국어 기반 언어모델을 대상으로 적용 가능한 혁신적인 2단계 공격 프레임워크를 제안한다. 1단계 공격은 학습 과정에 대한 개입 없이 수행되는 범용 적대적 트리거(Universal Adversarial Trigger) 탐색 공격으로, 모델의 그라디언트 정보만 활용하여 예측을 반전시킬 수 있는 단일 토큰 트리거를 정밀하게 탐색한 후 공격을 수행한다. 1단계 공격에서 실패한 샘플들만을 대상으로 수행하는 2단계 공격은 적대적 예제 공격으로, 형태 보존 최소 편집 전략에 의거해 조사와 어미를 결합한 토큰을 2개 이내에서 교체한다. 제안하는 프레임워크의 효용성은 NSMC 데이터세트를 대상으로 KoBERT 및 KoELECTRA 모델을 이용해 평가하였다. 실험 결과, 핵심 정보가 문장 말미에 나타나는 한국어의 특성으로 인해 문장 뒤에 부착된 트리거가 높은 공격 성공률을 보였다. 그리고 간접적으로 정서를 표현하는 단어도 강력한 트리거로 작동하였다. KoBERT 모델의 공격 성공률은 0.963, KoELECTRA 모델의 공격 성공률은 0.940으로 확인되었다.

Bookmark

A Novel Two-Stage Attacks on Korean Language Models: Single-Token Triggers Search and Morphology-Preserving Minimal Edits

Key Points

Abstract

Cite This Study