Can LLMs Recognize Toxicity? Structured Toxicity Investigation Framework and Semantic-Based Metric
CoRR(2024)
摘要
In the pursuit of developing Large Language Models (LLMs) that adhere to
societal standards, it is imperative to discern the existence of toxicity in
the generated text. The majority of existing toxicity metrics rely on encoder
models trained on specific toxicity datasets. However, these encoders are
susceptible to out-of-distribution (OOD) problems and depend on the definition
of toxicity assumed in a dataset. In this paper, we introduce an automatic
robust metric grounded on LLMs to distinguish whether model responses are
toxic. We start by analyzing the toxicity factors, followed by examining the
intrinsic toxic attributes of LLMs to ascertain their suitability as
evaluators. Subsequently, we evaluate our metric, LLMs As ToxiciTy Evaluators
(LATTE), on evaluation datasets.The empirical results indicate outstanding
performance in measuring toxicity, improving upon state-of-the-art metrics by
12 points in F1 score without training procedure. We also show that upstream
toxicity has an influence on downstream metrics.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要