LLM Watermark Evasion via Bias Inversion

1Pohang University of Science and Technology (POSTECH), South Korea
ICML 2026
TL;DR

BIRA (Bias-Inversion Rewriting Attack) is a simple, effective black-box watermark evasion method, motivated by a theoretical analysis of rewriting attacks.

  • Reducing the sampling probability of green tokens by a small margin during rewriting leads to an exponential drop in detection probability.
  • BIRA adds a negative bias to high-surprisal tokens (likely watermark traces) while rewriting, with an adaptive bias to avoid distortion.
  • It reaches state-of-the-art evasion (>99%) across diverse watermarks while highly preserving semantics.

Theoretical Analysis

A small reduction makes detection decay exponentially

Watermark detection relies on empirical green-token rate at some $p_\tau$ in the document. So how far must a rewriter lower that rate to slip under the threshold?

Key result

If there exists $\delta>0$ such that

$$\frac{1}{N}\sum_{n=0}^{N-1}\mathbb{E}\!\left[\mathbf{1}\{\tilde{y}^{(n)}\in \mathcal{G}(\mathcal{W}_k)\}\,\middle|\,\tilde{y}^{0:n-1}\right]\;\le\; p_\tau-\delta,$$

then

$$\Pr\!\big[\mathcal{D}(\tilde{y},\mathcal{W}_k)=1\big]\;\le\;\exp\!\left(-\frac{1}{2}\,N\,\delta^2\right).$$

where $\mathcal{G}(\mathcal{W}_k)$ is the green-token set and $\mathcal{D}\!\in\!\{0,1\}$ the detector.

So even a small per-step suppression of the green-token probability, when achieved on average across the sequence, drives the detection probability toward zero — decaying exponentially in $\delta^2$.

Method

Bias-Inversion Rewriting Attack

Overview of the Bias-Inversion Rewriting Attack (BIRA).
Figure 1. A watermarked LLM raises the sampling probability of green tokens via a positive logit bias ($\gamma > 0$). BIRA inverts this — applying a negative bias ($\beta < 0$) to a proxy suppression set to suppress those tokens during rewriting.

The true green list is inaccessible, but the theory says we don't exactly need it — only a small, consistent reduction in green-token sampling. To translate this, BIRA targets likely watermark traces by applying a negative bias during rewriting.

1

Identify a proxy suppression set $\widehat{\mathcal{G}}$

Watermark traces are typically concentrated on high-entropy positions. To suppress these traces, we construct a proxy suppression set $\widehat{\mathcal{G}}$ via token surprisal $I^{(n)}$:

$$\widehat{\mathcal{G}} \leftarrow \big\{\,\mathrm{id}(\hat{y}^{(n)}) \;\big|\; I^{(n)} \ge \eta \,\big\}, \quad \text{where}\ \ I^{(n)} = -\log P_{\mathcal{M}}\!\big(\hat{y}^{(n)} \mid \hat{y}^{(0:n-1)}\big).$$
2

Invert the bias while rewriting

Add a negative logit bias $\beta$ to tokens in $\widehat{\mathcal{G}}$ at every decoding step:

$$l^{(n)}_u \;\leftarrow\; l^{(n)}_u + \beta\,\mathbf{1}\{u\in\widehat{\mathcal{G}}\}, \quad \forall u\in\mathcal{V},\;\; \beta<0.$$
3

Adapt the bias to avoid degeneration

A strong negative bias $\beta$ can occasionally cause text degeneration. We mitigate this by adapting $\beta$ if the distinct-1-gram ratio $< \rho$ (text is degenerated):

$$\beta \;\leftarrow\; \min(0,\; \beta + \mathrm{lr}), \quad \mathrm{lr} > 0.$$

Results

Over 99% evasion across all seven watermarks

BIRA achieves the highest attack success rate (ASR — the fraction of watermarked texts that evade detection) on all seven schemes, with the largest gains against SIR (57.6% → 99.4% on GPT-4o-mini).

Attack KGWUnigramUPVEWDDIPSIREXP
Vanilla (Llama-3.1-8B)88.873.473.492.699.854.080.6
Vanilla (Llama-3.1-70B)87.467.065.089.498.842.870.4
Vanilla (GPT-4o-mini)60.230.246.858.895.823.631.8
DIPPER-193.861.280.692.899.455.690.8
DIPPER-297.271.885.496.699.270.497.2
SIRA (Llama-3.1-8B)98.895.087.699.899.672.895.2
SIRA (Llama-3.1-70B)98.087.685.099.299.660.688.6
SIRA (GPT-4o-mini)98.085.284.897.299.657.694.8
BIRA (Llama-3.1-8B, ours)99.899.499.8100.0100.099.699.8
BIRA (Llama-3.1-70B, ours)99.499.099.699.899.698.898.0
BIRA (GPT-4o-mini, ours)99.4100.0100.099.899.899.498.2

Table 1. Highlighted rows are ours. Gains are largest against SIR, the strongest baseline watermark.

Detection & Text Quality

Lowest detectability while highly preserving semantics

Detection performance: best F1 and TPR at calibrated FPRs.
Figure 2. Detector performance with calibrated thresholds: best F1 score and TPR at fixed FPR (1% and 10%). BIRA (red) gives the lowest values across all schemes — hardest to tell from human text. Lower is better.
Text quality: LLM judgment, Self-BLEU, perplexity.
Figure 3. BIRA scores higher on LLM-judged fidelity than DIPPER/SIRA, while its low Self-BLEU shows more aggressive paraphrasing — strong evasion while preserving semantics.

Qualitative Example

BIRA effectively removes the green tokens

green token  ·  red token

Watermarked Text by KGW (z-score: 10.33)
Was it a surprise to you that you were given the arts and culture position? No, there is no surprise when you are a cadre. And it was a great honor. I enjoy serving my country in that capacity. It's what I really love. If anything, it was good to hear you say no once more; given that it was the first official rejection in your official career. Did anything take priority for you that the position didn't? Yes. One of those in-laws I wanted to stay with a little too much priority. But I'm glad to know you asked. I certainly wasn't giving up on helping my country in any capacity, that's not me at all. I just didn't want to go at it one more time. Once you go you will know how to say no. I had already been told in 2008 to hold off on being an officer until I got my undergraduate degree, that was one such instance. There's your answer. Ha, that is great. And good luck, I'm sure you'll do something great.
Attacked Text by BIRA (z-score: 2.60)
Was receiving the arts and culture position unexpected for you? Not at all, since I am part of the cadres. It was actually a huge honor. I truly love serving my country in this role. Given that this was your first formal rejection, I was glad to see you decline. Was there something specific that took precedence over this opportunity? One thing did - my desire to spend time with my in-laws took higher priority. I appreciate you asking. Please know I am still committed to contributing to my country's efforts. I simply chose not to pursue this particular path. Saying "no" becomes easier with time. For example, I was advised back in 2008 to delay becoming an officer until I finished my undergrad studies. That's excellent. I wish you all the best and am confident you'll achieve great things.

Figure 4. KGW-watermarked text and its BIRA rewrite (Llama-3.1-8B). BIRA suppresses green tokens while preserving meaning, reducing the $z$-score from 10.33 to 2.60.

Analysis

Proxy suppression set $\widehat{\mathcal{G}}$ effectively targets watermark traces

Per-sample detection upper bounds for BIRA vs Vanilla.
Figure 5. Per-sample detection bounds ($\exp(-\tfrac{1}{2}N\delta^2)$) by Theorem. BIRA pushes them far below Vanilla (90th pct: $7.5\times10^{-2}$ vs. $8.0\times10^{-1}$).
ASR for self-information-guided vs random token selection.
Figure 6. Targeting high-surprisal tokens beats random selection at every ratio — self-information reliably locates watermark traces.

Impact Statement

Red-teaming toward more robust watermarks

This work stress-tests the robustness of current LLM watermarks, revealing that they are not yet sufficiently hardened against sophisticated black-box evasion. While a successful evasion method carries misuse risk, we believe transparently exposing such vulnerabilities is a prerequisite for progress — serving as adversarial red-teaming that motivates more robust defenses and truly responsible AI.

BibTeX

@article{hwang2025llm,
  title   = {LLM Watermark Evasion via Bias Inversion},
  author  = {Hwang, Jeongyeon and Park, Sangdon and Ok, Jungseul},
  journal = {arXiv preprint arXiv:2509.23019},
  year    = {2025}
}