Efficient Black-Box Adversarial Attacks on Neural Text Detectors
Neural text detectors are models trained to detect whether a given text was generated by a language model or written by a human. In this paper, we investigate three simple and resource-efficient strategies (parameter tweaking, prompt engineering, and character-level mutations) to alter texts generated by GPT-3.5 that are unsuspicious or unnoticeable for humans but cause misclassification by neural text detectors. The results show that especially parameter tweaking and character-level mutations are effective strategies.
PDF AbstractDatasets
Add Datasets
introduced or used in this paper
Results from the Paper
Submit
results from this paper
to get state-of-the-art GitHub badges and help the
community compare results to other papers.
Methods
Adam •
Attention Dropout •
BPE •
Cosine Annealing •
Dense Connections •
Dropout •
Fixed Factorized Attention •
GELU •
GPT-3 •
Layer Normalization •
Linear Layer •
Linear Warmup With Cosine Annealing •
Multi-Head Attention •
Residual Connection •
Scaled Dot-Product Attention •
Softmax •
Strided Attention •
Weight Decay