Watermark Resilience Analysis | Oversight Protocol

Zion Boggan · April 2026 · Oversight Protocol v0.4.4 (measurement snapshot), documentation current as of v0.4.5

Oversight embeds three independent watermark layers in each recipient's copy of a sealed document. Each layer targets a different class of attack, and each has known failure modes. This page presents a frank analysis of what the watermark stack can and cannot survive, grounded in the academic literature on text watermarking and informed by the anti-stripping research conducted during protocol development.

Attack Taxonomy

The adversary receives a watermarked document and wants to leak it without attribution. The following table categorizes the principal attack classes by effort required and which watermark layers they defeat.

Attack Class	Effort	Defeats L1 (ZW)	Defeats L2 (WS)	Defeats L3 (Semantic)
Strip invisible characters	Trivial (one regex)	Yes	No	No
Strip trailing whitespace	Trivial (one regex)	No	Yes	No
Both normalizations	Trivial	Yes	Yes	No
Manual paraphrase	High (hours of work)	Yes	Yes	Yes
LLM paraphrase	Low (~$0.88/M tokens)	Yes	Yes	Yes
Screenshot + OCR	Medium	Yes	Yes	No
Screenshot + retype	High	Yes	Yes	Yes
Format conversion (DOCX to TXT)	Trivial	Usually yes	Partial	No

The fundamental challenge is straightforward: any mark embedded in formatting or invisible characters dies on normalization; any mark embedded in word choice dies on paraphrase. The only signals that survive total rewrite are structural and semantic properties of the content itself, and even those degrade under aggressive modification.

Per-Layer Resilience

L1: Zero-Width Unicode

L1 inserts frames of zero-width characters (ZWSP U+200B, ZWNJ U+200C, ZWJ U+200D) at regular intervals throughout the text. Each frame encodes the full 64-bit mark_id. Multiple redundant copies are scattered at configurable density (default: one frame per 40 visible characters).

L1 survives copy-paste in most applications and some format conversions. It is defeated by a single regex: [\u200b\u200c\u200d\ufeff] applied globally. Any text-normalization library or linter that strips invisible characters destroys L1 completely. Recovery from L1 is binary: either the frames are present (and extraction is trivial) or they are gone. There is no partial recovery.

L1 exists not because it is robust, but because it is cheap to embed, cheap to extract, and catches the lazy adversary who copies the text without thinking about watermarks.

L2: Trailing Whitespace

L2 encodes mark_id bits as trailing space (bit 0) vs trailing tab (bit 1) on lines that have no existing trailing whitespace. A 64-bit mark_id requires 64 eligible lines. Partial recovery is supported: if only 40 of 64 lines survive, the recovered bits yield a partial candidate with 62.5% confidence that may still contribute to multi-layer fusion.

L2 survives more aggressive cleaning than L1 because many text editors and linters do not strip trailing whitespace from content fields (as opposed to code). However, any tool that normalizes line endings with trailing-space removal destroys L2. Format conversion (DOCX to TXT, PDF to TXT) usually strips trailing whitespace.

L3: Semantic Marks

L3 is Oversight's primary resilience layer. It encodes attribution signals in the choice of words and punctuation, not in formatting metadata. L3 comprises five sublayers operating independently:

T1 (synonym rotation): For each word belonging to one of 151 synonym classes (e.g., {begin, start, commence}), the mark_id determines which variant is selected via a deterministic SHA-256-based expansion. The attacker cannot distinguish a watermarked synonym choice from a natural one without access to the original text. T1 survives format conversion, invisible-character stripping, OCR, and faithful retyping. It is defeated by paraphrasing that changes more than 30% of synonym-class words.

T2 (punctuation fingerprint): Three deterministic choices per document: Oxford comma presence, em dash vs double-hyphen, and curly vs straight quotes. Yields 3 bits per document. Survives almost everything including OCR.

T2b (spelling variants): 25 British/American spelling pairs (color/colour, organize/organise, etc.) keyed to mark_id bits. Yields up to 25 bits for documents containing these words. Survives format conversion and OCR.

T2c (contractions): 30 contraction/expansion pairs (don't/do not, it's/it is, etc.) keyed to mark_id bits. Survives everything except deliberate re-contracting or re-expanding.

T2d (number formatting): Comma separators in large numbers (1,000 vs 1000) and percent symbol vs word form (50% vs 50 percent). Yields 2 bits.

The combined L3 capacity is 30 to 80 bits per page of normal prose, depending on vocabulary composition. Verification uses a weighted scoring system (synonyms 50%, spelling 20%, contractions 20%, punctuation 10%) with an overall match threshold of 0.65.

Cross-Format Survival Summary

Layer	DOCX to TXT	PDF to TXT	Copy-Paste	Email Forward
L1 (zero-width)	Maybe	Usually no	Maybe	Usually no
L2 (whitespace)	Usually no	No	No	No
L3 (synonyms)	Yes	Yes	Yes	Yes
L3 (punctuation)	Yes	Yes	Yes	Yes (mostly)

Error-Correcting Code Protection

The ecc module wraps L3 synonym bits in a repetition code with majority-vote decoding. Each payload bit is repeated R times (default R=7), and decoding recovers the original by majority vote over each group.

With R=7, up to 3 errors per group are corrected, yielding an effective tolerance of approximately 40% random bit error rate. For a 64-bit mark_id encoded with R=7, the coded signal requires 448 synonym-class instances. With approximately 150 synonym-class hits per page of prose, three pages provide full coverage.

The ECC layer transforms L3 from a fragile threshold-based system (where the legacy verify_synonyms_match() uses a 70% match cutoff) into one with mathematically bounded error tolerance. If the attacker flips fewer than floor(R/2) bits per group, the payload is recovered with certainty. If they flip more, detection fails gracefully rather than producing false positives.

The approach is simpler than real BCH or Reed-Solomon codes (no Galois field arithmetic is required), but it achieves the practical goal. A future version may replace the repetition code with BCH(63,16,11), which would encode 16 data bits into 63 coded bits tolerating up to 11 errors (17.5% error rate) with better bandwidth efficiency than R=7 repetition.

Content Fingerprinting: The Server-Side Fallback

Content fingerprinting is the defense of last resort when all embedded watermarks have been destroyed. The fingerprint database is not a watermark: the fingerprints are computed at seal time and stored server-side (in the registry or alongside the sealed file), never embedded in the document. Because the adversary cannot strip what is not present in the document, this layer survives the VM-export attack (airgapped VM, strip everything, export clean file).

Oversight computes two independent fingerprint types.

Winnowing (Schleimer, Wilkerson, Aiken; SIGMOD 2003) computes rolling hash fingerprints over character k-grams of the normalized text and selects a subset via the winnowing algorithm (minimum hash in each window of size W). Comparison uses Jaccard similarity over the selected hash sets. Winnowing detects near-verbatim partial copies with high precision. It does not survive paraphrasing, because k-gram hashes change when words change.

Semantic sentence hashing computes SHA-256 hashes (truncated to 16 hex characters) over the sorted content words of each sentence. Sorting provides order-independence within each sentence, so minor word reordering does not change the hash. Comparison uses set-overlap (fraction of hashes from the leaked text that appear in the stored fingerprint). This method survives minor edits and format conversion but not heavy paraphrasing.

The combined score is 0.4 * winnowing + 0.6 * sentence, reflecting the greater robustness of sentence-level hashing. A combined score of 0.6 or higher produces a MATCH verdict; 0.3 or higher produces LIKELY.

Because each recipient's copy has slightly different synonym choices (from L3), the fingerprints of different recipients' copies are measurably different. The leaked text's fingerprint is compared against all stored per-recipient fingerprints, and the closest match identifies the source copy.

Information-Theoretic Capacity Bounds

The fundamental limit for post-hoc text watermarking can be analyzed through the lens of the binary symmetric channel. Each synonym-class instance acts as a discrete channel with capacity log2(K) bits where K is the number of variants (typically 3, giving 1.58 bits per instance). An adversary who paraphrases independently flips each bit with probability p.

For a binary symmetric channel with crossover probability p, the capacity is 1 - H(p) bits per channel use, where H is the binary entropy function. The practical implications, assuming 150 synonym-class instances per page of prose:

Error Rate (p)	Scenario	Bits/Instance	Usable Bits/Page	Pages for 64-bit mark_id
0.10	Light paraphrasing	0.53	~80	~1
0.20	Moderate paraphrasing	0.28	~42	~2
0.30	LLM paraphrase (moderate)	0.12	~18	~4
0.50	Aggressive LLM rewrite	0.00	0	Impossible

At p=0.5, the channel capacity drops to zero. No error-correcting code, no matter how sophisticated, can recover the payload. At this point the watermark is information-theoretically destroyed, and the fingerprint database is the only remaining attribution mechanism.

The No Free Lunch Theorem

Pang, Hu, et al. (NeurIPS 2024) proved a fundamental trade-off in watermarking design, identifying three properties that cannot all be maximized simultaneously:

Robustness: the watermark survives text modifications. Anti-spoofing: adversaries cannot forge the watermark onto arbitrary text to frame an innocent party. Easy detection: detection does not require storing all generated text server-side.

Robust watermarks are inherently vulnerable to spoofing attacks. Publicly detectable watermarks (using cryptographic signatures) resist spoofing but sacrifice robustness. No design can achieve all three properties at their theoretical maximum.

Oversight's threat model prioritizes robustness (leak attribution) over anti-spoofing, and mitigates the spoofing risk through the existing manifest signature and Rekor transparency log. The signed manifest binds each mark_id to a specific recipient, issuer, and timestamp. An attacker who forges a watermark into a document cannot produce a valid manifest signature from the original issuer, so the forged mark would not resolve in the registry. This is an imperfect defense (it requires the registry to be consulted), but it shifts the No Free Lunch balance toward robustness without leaving spoofing completely unaddressed.

Best Achievable Resilience

Based on the academic literature survey (19 papers, 2001 to 2026) and empirical testing, the following table represents realistic expected detection rates for post-hoc text watermarking as of 2026.

Attack	Best Defense	Expected Detection Rate
Normalization (strip invisibles)	Semantic marks (L3)	100%
Format conversion	Semantic marks (L3)	100%
Manual paraphrase (light)	Synonym rotation + ECC	80-90%
Manual paraphrase (heavy)	Fingerprint database	60-70%
LLM paraphrase (SIRA-class)	ECC + redundancy + fingerprint DB	40-60%
Screenshot + faithful retype	Synonym marks	90%+
Screenshot + paraphrase retype	Fingerprint DB only	30-50%
Full hostile rewrite	Nothing at watermark level	Use canary traps / unique facts

The LLM-Stripping Threat

LLM-based paraphrasing is the most dangerous attack vector because it is cheap (approximately $0.88 per million tokens) and effective. The SIRA attack (Self-Information Rewrite Attack; Cheng et al., ICML 2025) achieves nearly 100% success against seven recent watermarking methods by targeting high-entropy tokens, which are precisely where watermark information concentrates. Watermarks must place signal in high-entropy positions because low-entropy tokens are too predictable to carry information. SIRA exploits this structural weakness.

Oversight's defense-in-depth strategy against LLM stripping relies on three observations. First, L3's synonym rotation distributes signal across many low-salience word choices, making the per-token information density low enough that SIRA-style targeted attacks must modify a large fraction of the text to succeed. Second, the ECC layer tolerates moderate bit errors from partial paraphrasing. Third, the fingerprint database provides attribution even when the watermark is fully destroyed, as long as the leaked text retains enough structural similarity to the original.

Against aggressive LLM rewriting where the attacker rewrites every sentence, no embedded watermark survives (the Shannon capacity at p=0.5 is zero). The fingerprint database's effectiveness also degrades because semantic sentence hashes change under full rewriting. At this extreme, the remaining defense is canary content (unique facts per recipient), which requires author cooperation and is not automatable.

Honest Limitations

The following limitations are inherent to text watermarking as a field, not specific deficiencies in Oversight's implementation. Any system claiming to solve these problems should be viewed with skepticism.

No text watermark survives a determined adversary with LLM access. If the attacker is willing to fully rewrite the content in their own words (or pay an LLM to do so), all embedded marks are destroyed. The best achievable goal is making stripping expensive, detectable, and probabilistically attributable, not making it impossible.

Semantic watermarks cannot defend against paraphrasing beyond ~50% word substitution. The information-theoretic channel capacity drops to zero at the 50% crossover probability. This is a mathematical limit, not an engineering one.

Content fingerprinting degrades with rewriting distance. The fingerprint database is a powerful fallback for near-verbatim and lightly-modified leaks. For heavily rewritten text, cosine similarity between sentence embeddings drops below the attribution threshold. Modern sentence transformers (2024-2025) achieve greater than 0.85 cosine similarity for paraphrases, but this drops below 0.7 for semantically different content.

False positive risk is nonzero. With 151 three-variant synonym classes and a 70% match threshold across 40 instances, the random match probability is approximately 10^-6. With ECC, the false positive rate drops below 10^-9 for 40+ valid bits. These are acceptable rates for forensic use, but they are not zero. Watermark evidence should be corroborated with other signals (beacon callbacks, access logs, registry records) before taking action against a specific recipient.

Airgapped readers leave no network beacon. DNS and HTTP beacons fire only when the document is opened in a network-connected environment. An adversary who reads the document in an airgapped VM and leaks a summary (not the document itself) produces no watermark evidence and no beacon event. The only defense at that point is operational security: controlling who receives the document in the first place.

Oversight's defense-in-depth is the right architecture. Multiple independent layers, each targeting different attack classes, provide strictly better coverage than any single layer. The current stack (L1 + L2 + L3 + fingerprint DB + beacons + registry) is not individually unbreakable, but the combined attack surface is large enough that an adversary must invest significant effort to defeat all layers simultaneously.

This analysis reflects the state of text watermarking research as of April 2026. The anti-stripping research document with 19 cited papers is available in the repository (ANTI_STRIPPING_RESEARCH.md).