Search Results for author: Ruixuan Huang

Found 2 papers, 1 papers with code

Evaluating Concept-based Explanations of Language Models: A Study on Faithfulness and Readability

1 code implementation • 29 Apr 2024 • Meng Li, Haoran Jin, Ruixuan Huang, Zhihao Xu, Defu Lian, Zijia Lin, Di Zhang, Xiting Wang

Based on this, we quantify faithfulness via the difference in the output upon perturbation.

Paper
Code

Uncovering Safety Risks in Open-source LLMs through Concept Activation Vector

no code implementations • 18 Apr 2024 • Zhihao Xu, Ruixuan Huang, Xiting Wang, Fangzhao Wu, Jing Yao, Xing Xie

Even when successful, the harmfulness of their outputs cannot be guaranteed, leading to suspicions that these methods have not accurately identified the safety vulnerabilities of LLMs.

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.