Key points are not available for this paper at this time.
large language models (LLMs) are widely used, the experts have raised significant concerns about their controllability, safety, and usability. However, the vulnerability and evaluation methods of alignment become the research focus. Although RLHF secure alignment techniques attempt to solve these problems, they still face challenges such as high labeling costs and wrong target generalization. This study explores future trends, benefits, and challenges of LLM security alignment technology. It proposes a security alignment framework for LLMs tailored to individual preferences. This framework employs algorithmic and data countermeasures to enhance the model's generalization performance, lower the costs associated with manual labeling, and improve LLMs' controllability, usability, and safety. This innovative approach provides useful implications for future secure alignment developments in LLMs.
Sun et al. (Fri,) studied this question.