ShieldLM: Empowering LLMs as Aligned, Customizable and Explainable Safety Detectors
CoRR(2024)
摘要
The safety of Large Language Models (LLMs) has gained increasing attention in
recent years, but there still lacks a comprehensive approach for detecting
safety issues within LLMs' responses in an aligned, customizable and
explainable manner. In this paper, we propose ShieldLM, an LLM-based safety
detector, which aligns with general human safety standards, supports
customizable detection rules, and provides explanations for its decisions. To
train ShieldLM, we compile a large bilingual dataset comprising 14,387
query-response pairs, annotating the safety of responses based on various
safety standards. Through extensive experiments, we demonstrate that ShieldLM
surpasses strong baselines across four test sets, showcasing remarkable
customizability and explainability. Besides performing well on standard
detection datasets, ShieldLM has also been shown to be effective in real-world
situations as a safety evaluator for advanced LLMs. We release ShieldLM at
to support accurate and explainable
safety detection under various safety standards, contributing to the ongoing
efforts to enhance the safety of LLMs.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要