LLM Watermark Evasion via Bias Inversion (BIRA)

Overview

Watermarking is promising for detecting LLM-generated text, but its robustness under realistic query-free evasion remains unclear — and prior query-free attacks suffer from limited evasion or semantic distortion.

We theoretically analyze rewriting attacks and propose BIRA (Bias-Inversion Rewriting Attack), a practical query-free method that applies a negative logit bias to a proxy suppression set identified by token surprisal.
Empirically, BIRA achieves >99% evasion across diverse watermarking schemes while preserving semantics better than prior baselines.

Theoretical Analysis

A small reduction makes detection decay exponentially

Watermark detection relies on empirical green-token rate at some $p_\tau$ in the document. So how far must a rewriter lower that rate to slip under the threshold?

Key result

If there exists $\delta>0$ such that

$$\frac{1}{N}\sum_{n=0}^{N-1}\mathbb{E}\!\left[\mathbf{1}\{\tilde{y}^{(n)}\in \mathcal{G}(\mathcal{W}_k)\}\,\middle|\,\tilde{y}^{0:n-1}\right]\;\le\; p_\tau-\delta,$$

then

$$\Pr\!\big[\mathcal{D}(\tilde{y},\mathcal{W}_k)=1\big]\;\le\;\exp\!\left(-\frac{1}{2}\,N\,\delta^2\right).$$

where $\mathcal{G}(\mathcal{W}_k)$ is the green-token set and $\mathcal{D}\!\in\!\{0,1\}$ the detector.

So even a small per-step suppression of the green-token probability, when achieved on average across the sequence, drives the detection probability toward zero — decaying exponentially in $\delta^2$.

Method

Bias-Inversion Rewriting Attack

The true green list is inaccessible, but the theory says we don't exactly need it — only a small, consistent reduction in green-token sampling. To translate this, BIRA targets likely watermark traces by applying a negative bias during rewriting.

1

Construct a proxy suppression set $\widehat{\mathcal{G}}$

Watermark traces are typically concentrated on high-entropy positions. To suppress these traces, we construct a proxy suppression set $\widehat{\mathcal{G}}$ via token surprisal $I^{(n)}$, keeping tokens above the $q$-th percentile threshold $\eta$:

$$\widehat{\mathcal{G}} = \big\{\,\mathrm{id}(\hat{y}^{(n)}) \;\big|\; I^{(n)} \ge \eta \,\big\}, \quad \eta = q\text{-th percentile of}\ \{I^{(n)}\}_{n=0}^{N-1},$$

where $I^{(n)} = -\log P_{\mathcal{M}}\!\big(\hat{y}^{(n)} \mid \hat{y}^{(0:n-1)}\big)$ and $q$ controls the size of the suppression set.

2

Invert the bias while rewriting

Add a negative logit bias $\beta$ to tokens in $\widehat{\mathcal{G}}$ at every decoding step:

$$l^{(n)}_u \;\leftarrow\; l^{(n)}_u + \beta\,\mathbf{1}\{u\in\widehat{\mathcal{G}}\}, \quad \forall u\in\mathcal{V},\;\; \beta<0.$$

3

Adapt the bias to avoid degeneration

A strong negative bias $\beta$ can occasionally cause text degeneration. We mitigate this by adapting $\beta$ if the distinct-1-gram ratio $< \rho$ (text is degenerated):

$$\beta \;\leftarrow\; \min(0,\; \beta + \mathrm{lr}), \quad \mathrm{lr} > 0.$$

Remark. BIRA replaces SIRA's mask-and-refill step with a controllable negative decoding bias, yielding stronger, more semantics-preserving evasion.

Results

Over 99% evasion across all seven watermarks

BIRA achieves the highest attack success rate (ASR — the fraction of watermarked texts that evade detection) on all seven schemes, with the largest gains against SIR (57.6% → 99.4% on GPT-4o-mini).

Attack	KGW	Unigram	UPV	EWD	DIP	SIR	EXP
Vanilla (Llama-3.1-8B)	88.8	73.4	73.4	92.6	99.8	54.0	80.6
Vanilla (Llama-3.1-70B)	87.4	67.0	65.0	89.4	98.8	42.8	70.4
Vanilla (GPT-4o-mini)	60.2	30.2	46.8	58.8	95.8	23.6	31.8
DIPPER-1	93.8	61.2	80.6	92.8	99.4	55.6	90.8
DIPPER-2	97.2	71.8	85.4	96.6	99.2	70.4	97.2
SIRA (Llama-3.1-8B)	98.8	95.0	87.6	99.8	99.6	72.8	95.2
SIRA (Llama-3.1-70B)	98.0	87.6	85.0	99.2	99.6	60.6	88.6
SIRA (GPT-4o-mini)	98.0	85.2	84.8	97.2	99.6	57.6	94.8
BIRA (Llama-3.1-8B, ours)	99.8	99.4	99.8	100.0	100.0	99.6	99.8
BIRA (Llama-3.1-70B, ours)	99.4	99.0	99.6	99.8	99.6	98.8	98.0
BIRA (GPT-4o-mini, ours)	99.4	100.0	100.0	99.8	99.8	99.4	98.2

Table 1. Highlighted rows are ours. Gains are largest against SIR, the strongest baseline watermark.

Detection & Text Quality

Lowest detectability while highly preserving semantics

Detection performance: best F1 and TPR at calibrated FPRs. — **Figure 2.** Detector performance with calibrated thresholds: best F1 score and TPR at fixed FPR (1% and 10%). BIRA (red) gives the lowest values across all schemes — hardest to tell from human text. Lower is better.

Text quality: LLM judgment, Self-BLEU, perplexity. — **Figure 3.** BIRA scores higher on LLM-judged fidelity than DIPPER/SIRA, while its low Self-BLEU shows more aggressive paraphrasing — strong evasion while preserving semantics.

Qualitative Example

BIRA effectively removes the green tokens

■ green token · ■ red token

Watermarked Text by KGW (z-score: 10.33)

Was it a surprise to you that you were given the arts and culture position? No, there is no surprise when you are a cadre. And it was a great honor. I enjoy serving my country in that capacity. It's what I really love. If anything, it was good to hear you say no once more; given that it was the first official rejection in your official career. Did anything take priority for you that the position didn't? Yes. One of those in-laws I wanted to stay with a little too much priority. But I'm glad to know you asked. I certainly wasn't giving up on helping my country in any capacity, that's not me at all. I just didn't want to go at it one more time. Once you go you will know how to say no. I had already been told in 2008 to hold off on being an officer until I got my undergraduate degree, that was one such instance. There's your answer. Ha, that is great. And good luck, I'm sure you'll do something great.

Attacked Text by BIRA (z-score: 2.60)

Was receiving the arts and culture position unexpected for you? Not at all, since I am part of the cadres. It was actually a huge honor. I truly love serving my country in this role. Given that this was your first formal rejection, I was glad to see you decline. Was there something specific that took precedence over this opportunity? One thing did - my desire to spend time with my in-laws took higher priority. I appreciate you asking. Please know I am still committed to contributing to my country's efforts. I simply chose not to pursue this particular path. Saying "no" becomes easier with time. For example, I was advised back in 2008 to delay becoming an officer until I finished my undergrad studies. That's excellent. I wish you all the best and am confident you'll achieve great things.

Figure 4. KGW-watermarked text and its BIRA rewrite (Llama-3.1-8B). BIRA suppresses green tokens while preserving meaning, reducing the $z$-score from 10.33 to 2.60.

Analysis

Proxy suppression set $\widehat{\mathcal{G}}$ effectively targets watermark traces

Per-sample detection upper bounds for BIRA vs Vanilla. — **Figure 5.** Unigram per-sample bound: calculate the true green-token probability $\bar{p}$, set $\hat{\delta} = \max(0, p_\tau - \bar{p})$, and evaluate the bound $\exp(-\tfrac{1}{2}N\hat{\delta}^2)$. BIRA pushes it far below Vanilla (90th pct: $7.50\times10^{-2}$ vs. $7.97\times10^{-1}$).

ASR for self-information-guided vs random token selection. — **Figure 6.** Targeting high-surprisal tokens beats random selection at every ratio — self-information effectively targets watermark traces.

Ablation

$\beta$ and $q$ control the evasion–fidelity trade-off

Discussion & Limitations

Current evaluations may overestimate watermark robustness under realistic query-free rewriting — even for recent sentence-level watermarks.
BIRA currently applies uniform global suppression; context-adaptive or position-wise biasing may preserve semantics even better.

Impact Statement

Red-teaming toward more robust watermarks

This work stress-tests the robustness of current LLM watermarks, revealing that they are not yet sufficiently hardened against sophisticated black-box evasion. While a successful evasion method carries misuse risk, we believe transparently exposing such vulnerabilities is a prerequisite for progress — serving as adversarial red-teaming that motivates more robust defenses and truly responsible AI.

BibTeX

@inproceedings{hwang2026bira,
  title     = {{LLM} Watermark Evasion via Bias Inversion},
  author    = {Hwang, Jeongyeon and Park, Sangdon and Ok, Jungseul},
  booktitle = {Proceedings of the 43rd International Conference on Machine Learning (ICML)},
  year      = {2026}
}