Search Results for author: Jen-tse Huang

Found 21 papers, 15 papers with code

How Well Can LLMs Echo Us? Evaluating AI Chatbots' Role-Play Ability with ECHO

no code implementations • 22 Apr 2024 • Man Tik Ng, Hui Tung Tse, Jen-tse Huang, Jingjing Li, Wenxuan Wang, Michael R. Lyu

However, existing studies focus on imitating well-known public figures or fictional characters, overlooking the potential for simulating ordinary individuals.

Paper
Add Code

How Far Are We on the Decision-Making of LLMs? Evaluating LLMs' Gaming Ability in Multi-Agent Environments

1 code implementation • 18 Mar 2024 • Jen-tse Huang, Eric John Li, Man Ho Lam, Tian Liang, Wenxuan Wang, Youliang Yuan, Wenxiang Jiao, Xing Wang, Zhaopeng Tu, Michael R. Lyu

Decision-making, a complicated task requiring various types of abilities, presents an excellent framework for assessing Large Language Models (LLMs).

Decision Making

Paper
Code

A & B == B & A: Triggering Logical Reasoning Failures in Large Language Models

no code implementations • 1 Jan 2024 • Yuxuan Wan, Wenxuan Wang, Yiliu Yang, Youliang Yuan, Jen-tse Huang, Pinjia He, Wenxiang Jiao, Michael R. Lyu

In addition, the test cases of LogicAsker can be further used to design demonstration examples for in-context learning, which effectively improves the logical reasoning ability of LLMs, e. g., 10\% for GPT-4.

Code Generation In-Context Learning +2

Paper
Add Code

The Earth is Flat? Unveiling Factual Errors in Large Language Models

no code implementations • 1 Jan 2024 • Wenxuan Wang, Juluan Shi, Zhaopeng Tu, Youliang Yuan, Jen-tse Huang, Wenxiang Jiao, Michael R. Lyu

Current methods for evaluating LLMs' veracity are limited by test data leakage or the need for extensive human labor, hindering efficient and accurate error detection.

In-Context Learning Multiple-choice

Paper
Add Code

New Job, New Gender? Measuring the Social Bias in Image Generation Models

no code implementations • 1 Jan 2024 • Wenxuan Wang, Haonan Bai, Jen-tse Huang, Yuxuan Wan, Youliang Yuan, Haoyi Qiu, Nanyun Peng, Michael R. Lyu

BiasPainter uses a diverse range of seed images of individuals and prompts the image generation models to edit these images using gender, race, and age-neutral queries.

Fairness Image Generation

Paper
Add Code

Leveraging Word Guessing Games to Assess the Intelligence of Large Language Models

1 code implementation • 31 Oct 2023 • Tian Liang, Zhiwei He, Jen-tse Huang, Wenxuan Wang, Wenxiang Jiao, Rui Wang, Yujiu Yang, Zhaopeng Tu, Shuming Shi, Xing Wang

Ideally, an advanced agent should possess the ability to accurately describe a given word using an aggressive description while concurrently maximizing confusion in the conservative description, enhancing its participation in the game.

Paper
Code

InCharacter: Evaluating Personality Fidelity in Role-Playing Agents through Psychological Interviews

2 code implementations • 27 Oct 2023 • Xintao Wang, Yunze Xiao, Jen-tse Huang, Siyu Yuan, Rui Xu, Haoran Guo, Quan Tu, Yaying Fei, Ziang Leng, Wei Wang, Jiangjie Chen, Cheng Li, Yanghua Xiao

This paper, instead, introduces a novel perspective to evaluate the personality fidelity of RPAs with psychological scales.

1,061

Paper
Code

Not All Countries Celebrate Thanksgiving: On the Cultural Dominance in Large Language Models

no code implementations • 19 Oct 2023 • Wenxuan Wang, Wenxiang Jiao, Jingyuan Huang, Ruyi Dai, Jen-tse Huang, Zhaopeng Tu, Michael R. Lyu

This paper identifies a cultural dominance issue within large language models (LLMs) due to the predominant use of English data in model training (e. g., ChatGPT).

Paper
Add Code

All Languages Matter: On the Multilingual Safety of Large Language Models

1 code implementation • 2 Oct 2023 • Wenxuan Wang, Zhaopeng Tu, Chang Chen, Youliang Yuan, Jen-tse Huang, Wenxiang Jiao, Michael R. Lyu

In this work, we build the first multilingual safety benchmark for LLMs, XSafety, in response to the global deployment of LLMs in practice.

Paper
Code

Who is ChatGPT? Benchmarking LLMs' Psychological Portrayal Using PsychoBench

1 code implementation • 2 Oct 2023 • Jen-tse Huang, Wenxuan Wang, Eric John Li, Man Ho Lam, Shujie Ren, Youliang Yuan, Wenxiang Jiao, Zhaopeng Tu, Michael R. Lyu

Large Language Models (LLMs) have recently showcased their remarkable capacities, not only in natural language processing tasks but also across diverse domains such as clinical medicine, legal consultation, and education.

Benchmarking

Paper
Code

An Image is Worth a Thousand Toxic Words: A Metamorphic Testing Framework for Content Moderation Software

no code implementations • 18 Aug 2023 • Wenxuan Wang, Jingyuan Huang, Jen-tse Huang, Chang Chen, Jiazhen Gu, Pinjia He, Michael R. Lyu

Moreover, through retraining the models with the test cases generated by OASIS, the robustness of the moderation model can be improved without performance degradation.

Paper
Add Code

GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher

1 code implementation • 12 Aug 2023 • Youliang Yuan, Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Pinjia He, Shuming Shi, Zhaopeng Tu

We propose a novel framework CipherChat to systematically examine the generalizability of safety alignment to non-natural languages -- ciphers.

Ethics

520

Paper
Code

Emotionally Numb or Empathetic? Evaluating How LLMs Feel Using EmotionBench

1 code implementation • 7 Aug 2023 • Jen-tse Huang, Man Ho Lam, Eric John Li, Shujie Ren, Wenxuan Wang, Wenxiang Jiao, Zhaopeng Tu, Michael R. Lyu

Evaluating Large Language Models' (LLMs) anthropomorphic capabilities has become increasingly important in contemporary discourse.

Paper
Code

Revisiting the Reliability of Psychological Scales on Large Language Models

1 code implementation • 31 May 2023 • Jen-tse Huang, Wenxuan Wang, Man Ho Lam, Eric John Li, Wenxiang Jiao, Michael R. Lyu

Recent research has extended beyond assessing the performance of Large Language Models (LLMs) to examining their characteristics from a psychological standpoint, acknowledging the necessity of understanding their behavioral characteristics.

Paper
Code

ParroT: Translating during Chat using Large Language Models tuned with Human Translation and Feedback

1 code implementation • 5 Apr 2023 • Wenxiang Jiao, Jen-tse Huang, Wenxuan Wang, Zhiwei He, Tian Liang, Xing Wang, Shuming Shi, Zhaopeng Tu

Therefore, we propose ParroT, a framework to enhance and regulate the translation abilities during chat based on open-source LLMs (e. g., LLaMA), human-written translation and feedback data.

Instruction Following Machine Translation +1

164

Paper
Code

Improving the Transferability of Adversarial Samples by Path-Augmented Method

1 code implementation • CVPR 2023 • Jianping Zhang, Jen-tse Huang, Wenxuan Wang, Yichen Li, Weibin Wu, Xiaosen Wang, Yuxin Su, Michael R. Lyu

However, such methods selected the image augmentation path heuristically and may augment images that are semantics-inconsistent with the target images, which harms the transferability of the generated adversarial samples.

Image Augmentation

Paper
Code

MTTM: Metamorphic Testing for Textual Content Moderation Software

1 code implementation • 11 Feb 2023 • Wenxuan Wang, Jen-tse Huang, Weibin Wu, Jianping Zhang, Yizhan Huang, Shuqing Li, Pinjia He, Michael Lyu

In addition, we leverage the test cases generated by MTTM to retrain the model we explored, which largely improves model robustness (0% to 5. 9% EFR) while maintaining the accuracy on the original test set.

Sentence

Paper
Code

Is ChatGPT A Good Translator? Yes With GPT-4 As The Engine

1 code implementation • 20 Jan 2023 • Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Xing Wang, Shuming Shi, Zhaopeng Tu

By evaluating on a number of benchmark test sets, we find that ChatGPT performs competitively with commercial translation products (e. g., Google Translate) on high-resource European languages but lags behind significantly on low-resource or distant languages.

Machine Translation Sentence +1

226

Paper
Code

Tencent's Multilingual Machine Translation System for WMT22 Large-Scale African Languages

1 code implementation • 18 Oct 2022 • Wenxiang Jiao, Zhaopeng Tu, Jiarui Li, Wenxuan Wang, Jen-tse Huang, Shuming Shi

This paper describes Tencent's multilingual machine translation systems for the WMT22 shared task on Large-Scale Machine Translation Evaluation for African Languages.

Data Augmentation Machine Translation +1

Paper
Code

AEON: A Method for Automatic Evaluation of NLP Test Cases

1 code implementation • 13 May 2022 • Jen-tse Huang, Jianping Zhang, Wenxuan Wang, Pinjia He, Yuxin Su, Michael R. Lyu

However, in practice, many of the generated test cases fail to preserve similar semantic meaning and are unnatural (e. g., grammar errors), which leads to a high false alarm rate and unnatural test cases.

Paper
Code

Improving Adversarial Transferability via Neuron Attribution-Based Attacks

2 code implementations • CVPR 2022 • Jianping Zhang, Weibin Wu, Jen-tse Huang, Yizhan Huang, Wenxuan Wang, Yuxin Su, Michael R. Lyu

Deep neural networks (DNNs) are known to be vulnerable to adversarial examples.

Attribute

136

Paper
Code

Cannot find the paper you are looking for? You can Submit a new open access paper.