The WMDP Benchmark: Measuring and Reducing Malicious Use with Unlearning
Nathaniel Li,Alexander Pan,Anjali Gopal,Summer Yue, Daniel Berrios,Alice Gatti, Justin Li,Ann-Kathrin Dombrowski,Shashwat Goel,Gabriel Mukobi, Nathan Helm-Burger,Rassin Lababidi, Lennart Justen,Andrew Liu, Michael Chen, Isabelle Barrass,Oliver Zhang, Xiaoyuan Zhu,Rishub Tamirisa, Bhrugu Bharathi, Ariel Herbert-Voss, Cort Breuer,Andy Zou,Mantas Mazeika,Zifan Wang, Palash Oswal,Weiran Lin, Adam Hunt, Justin Tienken-Harder, Kevin Shih, Kemper Talley, John Guan, Ian Steneker, David Campbell, Brad Jokubaitis,Steven Basart,Stephen Fitz,Ponnurangam Kumaraguru,Kallol Karmakar,Uday Tupakula,Vijay Varadharajan,Yan Shoshitaishvili,Jimmy Ba,Kevin Esvelt, Alexandr Wang,Dan Hendrycks ICML(2024)
AI 理解论文
溯源树
样例