Ensuring that large language models (LLMs) align with human values and goals is crucial for their adoption in high-stakes decision-making. To guard against incorrect, misleading, or otherwise unexpected or undesirable LLM outputs, guardrail engineers implement guardrails based on expert knowledge from subject-matter authorities to steer and align pre-trained LLMs. Existing evaluation methods assess LLM performance, with and without guardrails, but provide limited insight into the contribution of each individual guardrail and its interactions on alignment. Here, we present a method to evaluate and select guardrails that best align LLM outputs with empirical evidence representing expert knowledge. Through evaluation with real-world illustrative examples of resume quality and recidivism prediction, we show that our method effectively identifies useful moderation guardrails in a way that could help guardrail engineers interpret contributions of different guardrails to "user-LLM" alignment.
Building similarity graph...
Analyzing shared references across papers
Loading...
Anindya Das Antar
Xun Huan
Nikola Banović
University of Michigan
Building similarity graph...
Analyzing shared references across papers
Loading...
Antar et al. (Wed,) studied this question.
www.synapsesocial.com/papers/68f19f20de32064e504ddbc7 — DOI: https://doi.org/10.1609/aies.v8i1.36583
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: