Zion Boggan · April 2026 · Oversight Protocol v0.4.4

Abstract

The Oversight protocol embeds three independent watermark layers into sealed documents for the purpose of post-hoc leak attribution. Each layer targets a different class of removal attack and degrades at a different rate under adversarial modification. This paper presents an empirical evaluation of watermark survival across eight attack classes (normalization, format conversion, copy-paste, OCR, manual paraphrase, LLM-based rewriting, and two compound scenarios), reports per-layer detection scores from controlled benchmarks, and analyzes the theoretical capacity limits of the synonym-based embedding channel using Shannon's binary symmetric channel model. A repetition-coded error-correcting layer (R=7) is shown to tolerate up to 42% random bit errors while recovering the original 64-bit mark identifier. Content fingerprinting via winnowing and sentence hashing is evaluated as a server-side fallback, with measured results demonstrating 100% attribution after complete L1 and L2 stripping. The paper identifies LLM-based paraphrasing as the principal unsolved threat, presents an honest assessment of current defenses against the SIRA attack, and derives information-theoretic bounds on the bits per page recoverable at various paraphrase rates. A survey of 19 papers from 2001 to 2026 situates these results within the broader watermarking literature.

1. Introduction

When a confidential document surfaces in an unauthorized channel, the first forensic question is attribution: whose copy was leaked? Watermarking systems attempt to answer this question by embedding recipient-specific signals into each distributed copy before delivery. The signals must be imperceptible to honest readers, recoverable from the leaked text by an authorized verifier, and resistant to deliberate removal by a motivated adversary.

The difficulty of this problem scales with the adversary's willingness to modify the document. A casual leaker who forwards the file unchanged preserves all marks. A moderately sophisticated adversary who normalizes invisible characters and trailing whitespace destroys formatting-level marks but leaves semantic content intact. A determined adversary with access to large language models can rewrite the document sentence by sentence at negligible cost, destroying all embedded signals. No watermarking system, regardless of its sophistication, can survive the last scenario, a fact that follows directly from Shannon's channel capacity theorem when the crossover probability reaches 0.5.

Oversight addresses this graduated threat landscape through defense in depth: three independent watermark layers, each targeting a different modification class, combined with server-side content fingerprinting that provides attribution even after all embedded marks are destroyed. The protocol does not claim to defeat all adversaries. It claims to raise the cost of unattributed leaking, to provide probabilistic attribution when the adversary's rewriting is imperfect, and to degrade gracefully as attack intensity increases.

This paper presents the first systematic evaluation of the Oversight watermark stack, covering per-layer survival rates, ECC error tolerance, content fingerprint resilience, information-theoretic capacity bounds, and an honest assessment of the LLM-based stripping threat. The evaluation uses controlled experiments on the v0.4.4 reference implementation, benchmarked on an Intel Core i7 under CPython 3.14.2. All measurements are means of 10 runs.

2. Experimental Setup

2.1 The Three Watermark Layers

Oversight applies three layers sequentially to the plaintext before encryption. Layer 1 (L1) encodes the 64-bit mark_id using zero-width Unicode characters (U+200B, U+200C, U+200D) inserted at regular intervals throughout the text. One complete frame of approximately 66 zero-width characters encodes the full mark_id, and frames are repeated every 40 visible characters for redundancy. L1 is invisible to human readers and survives copy-paste in most applications, but is trivially defeated by a single regex that strips non-printable Unicode.

Layer 2 (L2) encodes mark_id bits as trailing whitespace patterns: a trailing space represents bit 0, a trailing tab represents bit 1. Each eligible line (one without existing trailing whitespace) carries one bit, so a 64-bit mark_id requires 64 eligible lines. L2 is destroyed by any tool that normalizes trailing whitespace on save, a behavior common in code editors and many format conversion pipelines.

Layer 3 (L3) operates at the semantic level through five independent sublayers. The primary sublayer, T1, performs synonym rotation across 151 word classes, each containing 2 to 8 variants (e.g., {begin, start, commence}). The mark_id determines the variant selected at each position via a deterministic SHA-256 expansion. Four additional sublayers embed signal in punctuation choices (T2: Oxford comma, dash style, quote style), spelling variants (T2b: 25 British/American pairs), contraction expansion (T2c: 30 pairs), and number formatting (T2d: comma separators, percent symbol). The combined L3 capacity ranges from 30 to 80 bits per page of typical English prose.

2.2 ECC Protection

The ecc module wraps L3 synonym bits in a repetition code with majority-vote decoding. Each payload bit is repeated R times (default R=7). Decoding recovers the original by majority vote over each group, tolerating up to floor(R/2) = 3 errors per group. For a 64-bit mark_id at R=7, the coded signal spans 448 synonym-class instances. With approximately 150 instances per page, three pages of text provide full coverage with no redundancy gap.

2.3 Content Fingerprinting

Content fingerprints are computed at seal time and stored server-side (in the registry or alongside the sealed file). They are never embedded in the document, so the adversary cannot strip them. Two fingerprint types are computed independently. Winnowing (Schleimer, Wilkerson, and Aiken [16]) produces rolling hash fingerprints over character k-grams of the normalized text. Sentence hashing computes SHA-256 hashes (truncated to 16 hex characters) over the sorted content words of each sentence, providing order-independence within sentences. The combined similarity score is weighted: 0.4 for winnowing overlap plus 0.6 for sentence hash overlap.

3. Attack Taxonomy

The following table enumerates the principal attack classes, the effort required, and the watermark layers each class defeats. The taxonomy is organized by increasing adversary capability and cost.

Attack Class Effort Defeats L1 (Zero-Width) Defeats L2 (Whitespace) Defeats L3 (Semantic) Defeats Fingerprint DB
Strip invisible characters Trivial (one regex) Yes No No No
Strip trailing whitespace Trivial (one regex) No Yes No No
Both normalizations Trivial Yes Yes No No
Format conversion (DOCX to TXT) Trivial Usually yes Partial No No
Screenshot + OCR Medium Yes Yes No No
Manual paraphrase (light, 10-20%) High (hours) Yes Yes Partial No
LLM paraphrase (SIRA-class) Low (~$0.88/M tokens) Yes Yes Yes Partial
Screenshot + paraphrase retype High Yes Yes Yes Partial
Full hostile rewrite Variable Yes Yes Yes Yes

The fundamental structural constraint is that any mark embedded in formatting or invisible characters is destroyed by normalization, and any mark embedded in word choice is destroyed by paraphrasing. The only signals that persist through total rewrite are structural and semantic properties of the content itself, and even those degrade proportionally to the rewriting distance. The attack taxonomy therefore partitions cleanly into two regimes: pre-paraphrase (where L3 survives intact) and post-paraphrase (where only the fingerprint database and Bayesian fusion of weak residual signals remain).

4. Per-Layer Resilience Results

4.1 L1 Survival Rates

L1 recovery is binary: either the zero-width character frames are present in the extracted text, or they are not. There is no partial recovery because the encoding uses a fixed frame structure with delimiter characters. In controlled testing, L1 survived copy-paste operations in most desktop text editors and word processors (Microsoft Word, LibreOffice Writer, Notepad++). It was destroyed by format conversion from DOCX to plain TXT in the majority of tested conversion tools, by PDF-to-text extraction in all tested tools, and by email forwarding through all tested mail clients. A single regex application ([\u200b\u200c\u200d\ufeff] applied globally) eliminates L1 in under one millisecond regardless of document size.

L1 exists not because it is robust, but because it is cheap to embed, cheap to extract, and catches the adversary who copies the text without awareness of watermarking. Its role in the defense-in-depth stack is to provide fast, high-confidence attribution for the easiest attack scenario (verbatim copy) while contributing no value against even minimally sophisticated adversaries.

4.2 L2 Survival Rates

L2 exhibits marginally better resilience than L1 because many text editors and linters do not strip trailing whitespace from content fields (as opposed to source code). In testing, L2 survived copy-paste within the same application for 7 of 10 tested editors and survived email forwarding in 0 of 5 tested clients. Format conversion (DOCX to TXT, PDF to TXT) destroyed L2 in all tested pipelines. Editor save-on-close destroyed L2 in editors configured to strip trailing whitespace (3 of 10 tested). Partial recovery is supported: when only 40 of the 64 required lines survive, the recovered bits yield a 62.5%-confidence candidate that contributes to multi-layer Bayesian fusion scoring.

4.3 L3 Survival Rates

L3 is the primary resilience layer, and its survival characteristics are the most important for forensic attribution. The verify_semantic() function produces a weighted score between 0.0 and 1.0 by comparing the observed synonym, spelling, contraction, punctuation, and number-formatting choices against those predicted by a candidate mark_id. The benchmark results from the v0.4.4 reference implementation are:

Document Size Correct mark_id Score Wrong mark_id Score Separation
1 KB (~150 words) 0.900 0.332 0.568
10 KB (~1,500 words) 0.747 0.329 0.418
100 KB (~15,000 words) 0.585 0.337 0.248
1 MB (~150,000 words) 0.567 0.329 0.238

Several observations merit discussion. First, the correct-mark score decreases with document size (from 0.900 at 1 KB to 0.567 at 1 MB). This is expected: larger documents contain proportionally more words that fall outside the 151 synonym classes, diluting the signal-to-noise ratio of the weighted score. Second, the wrong-mark score remains stable near 0.33 across all sizes, which corresponds to the expected random match probability for three-variant synonym classes (1/3 = 0.333). Third, the separation between correct and wrong scores remains substantial even at 1 MB (0.238), indicating that attribution is feasible at all tested document sizes. The verification threshold of 0.65 correctly classifies 1 KB and 10 KB documents; for 100 KB and 1 MB documents, a lower threshold or Bayesian fusion with other layers is required.

L3 survives format conversion (DOCX to TXT, PDF to TXT), invisible-character stripping, OCR, and faithful retyping. The marks are encoded in the actual word and punctuation choices, not in formatting metadata. The critical failure mode is paraphrasing: when an adversary replaces synonym-class words with alternatives not predicted by the mark_id, the correct-mark score drops proportionally to the fraction of substituted instances.

4.4 Combined Multi-Layer Bayesian Fusion

The recommended architecture for attribution replaces the independent per-layer threshold checks with a unified Bayesian likelihood score that combines all available signals:

P(recipient=R | text) = P(L1 | R) * P(L2 | R) * P(L3_score | R) *
                        P(T2_punct | R) * P(fingerprint_sim | R)

Each signal contributes even when weak. A document where L1 and L2 have been stripped, L3 shows a 60% match, and the punctuation fingerprint shows 2 of 3 bits matching may not trigger any single-layer threshold, but the combined evidence is substantial. Empirically, Bayesian fusion raises the effective detection rate by 10 to 15 percentage points over single-layer thresholds for documents that have undergone normalization plus light paraphrasing.

5. ECC Error Tolerance

The repetition code with majority-vote decoding transforms L3 from a fragile threshold-based system into one with mathematically bounded error tolerance. The following table reports benchmark results for the R=7 repetition code operating on a 64-bit (8-byte) mark_id payload, producing 448 coded bits.

Metric R=3 (192 coded bits) R=5 (320 coded bits) R=7 (448 coded bits)
Encode time 22.6 us 23.3 us 23.6 us
Decode time (no errors) 49.6 us 49.3 us 50.8 us
Decode time (20% errors) 49.3 us 50.2 us 51.5 us
Corrected bits at 20% error rate 28 of 64 43 of 64 48 of 64
Confidence at 20% error rate 0.56 0.33 0.25
Maximum correctable error rate ~33% ~40% ~42%

All three repetition factors complete encode and decode operations in sub-100-microsecond time, making ECC overhead negligible relative to the synonym embedding and verification passes. The R=7 configuration corrects 48 of 64 payload bits at a 20% random error rate, and its theoretical maximum correctable rate is approximately 42% (the point at which majority vote fails for groups of 7). The confidence score decreases with higher R because each corrected group contributes less individual certainty, but the net payload recovery is substantially better.

The practical implication is that an adversary who paraphrases 20% of synonym-class words (a moderate attack intensity corresponding to light manual editing) will fail to prevent mark recovery when ECC is enabled. Without ECC, the same 20% error rate would reduce the L3 verification score below the 0.65 threshold for documents larger than 10 KB. The repetition code is simpler than algebraic codes (no Galois field arithmetic is required), but it achieves the primary goal. A future version may replace it with BCH(63,16,11), which encodes 16 data bits into 63 coded bits tolerating up to 11 errors (17.5% error rate) with superior bandwidth efficiency.

6. Content Fingerprinting as Last Resort

Content fingerprinting serves as the attribution mechanism of last resort when all embedded watermark layers have been stripped. Because the fingerprints are stored server-side and never embedded in the document, the adversary cannot strip them by modifying the document. The question is whether the leaked text retains enough structural similarity to the original for the fingerprint comparison to produce a match.

The critical test is the VM-strip-export scenario: the adversary opens the document in an airgapped virtual machine, strips all zero-width characters and trailing whitespace (destroying L1 and L2 completely), and exports the cleaned file. In this scenario, L3 semantic marks survive intact (because they are in the word choices, not the formatting), and the content fingerprint produces a 100% match against the stored fingerprint for that recipient's copy.

The measured fingerprint computation times from the v0.4.4 benchmarks are:

Document Size Fingerprint Time Winnowing Hashes Sentence Hashes
1 KB 3.37 ms 378 14
10 KB 32.0 ms 477 146
100 KB 321 ms 477 1,451
1 MB 3.35 s 477 14,862

Fingerprint computation is the most expensive per-byte operation in the Oversight pipeline at 3.35 seconds per megabyte, dominated by the rolling hash computation in the winnowing algorithm. The winnowing hash count plateaus at 477 for documents above 10 KB due to the window-based selection algorithm, while sentence hash counts scale linearly with document length. For attribution queries, fingerprint comparison is fast: Jaccard similarity over sorted hash sets runs in O(n log n) time, and sentence overlap runs in O(n) with a hash set. Comparing a leaked document against 1,000 stored fingerprints completes in under one second.

The fingerprint database's effectiveness degrades with rewriting distance. Winnowing fingerprints fail entirely under paraphrasing because character k-grams change when words change. Sentence hashes are more resilient (they sort content words within each sentence, providing order-independence) but fail when content words are replaced. For heavily rewritten text, neither fingerprint type produces a match. Modern sentence-transformer embeddings (not yet integrated into Oversight) achieve greater than 0.85 cosine similarity for paraphrases, offering a potential upgrade path for the fingerprint database.

7. Information-Theoretic Capacity Bounds

The fundamental capacity of the synonym-based watermark channel can be analyzed through the binary symmetric channel (BSC) model. Each synonym-class instance acts as a discrete channel carrying log2(K) bits, where K is the number of variants in the class (typically 3, yielding 1.58 bits). An adversary who paraphrases independently flips each bit with probability p, the crossover probability. For the BSC with crossover probability p, the channel capacity C is given by:

C = 1 - H(p) bits per channel use

where H(p) = -p * log2(p) - (1-p) * log2(1-p) is the binary entropy function

Applying this model to the Oversight synonym channel with 150 instances per page of English prose:

Crossover Probability (p) Scenario Capacity (bits/instance) Usable Bits per Page Pages Required for 64-bit mark_id
0.00 No paraphrasing 1.00 ~150 <1
0.10 Light paraphrasing 0.53 ~80 ~1
0.20 Moderate editing 0.28 ~42 ~2
0.30 LLM paraphrase (moderate) 0.12 ~18 ~4
0.40 LLM paraphrase (aggressive) 0.03 ~5 ~13
0.50 Total randomization 0.00 0 Impossible

The Shannon limit is unambiguous: at p = 0.5, channel capacity drops to zero. No encoding scheme, regardless of its sophistication, can transmit information over a zero-capacity channel. This is a mathematical fact, not an engineering limitation. The practical implication is that ECC-protected synonym watermarks are viable against light-to-moderate paraphrasing (p ≤ 0.3) but fundamentally cannot survive aggressive rewriting (p ≥ 0.5). For the aggressive case, the fingerprint database is the only remaining attribution mechanism, and its effectiveness also degrades with rewriting distance.

An important subtlety is that real paraphrasing does not produce independent, identically distributed bit flips. Human paraphrasing tends to preserve low-salience function words (which happen to include many synonym-class members such as "however," "nevertheless," "additionally") while rewriting high-salience content words. LLM paraphrasing exhibits different patterns depending on the model and prompt. The BSC model therefore represents a worst-case analysis; actual survival rates under non-adversarial paraphrasing are typically higher than the BSC prediction because the bit-flip distribution is non-uniform.

8. The No Free Lunch Theorem

Pang, Hu, et al. [3] proved a fundamental trade-off in watermarking design at NeurIPS 2024, identifying three properties that cannot all be maximized simultaneously: robustness (the watermark survives text modifications), anti-spoofing (adversaries cannot forge the watermark onto arbitrary text to frame an innocent party), and easy detection (detection does not require storing all generated text server-side).

The key finding is that robust watermarks are inherently vulnerable to spoofing attacks. Publicly detectable watermarks (using cryptographic signatures) resist spoofing but sacrifice robustness. No design can achieve all three properties at their theoretical maximum. Practitioners must choose which threat to prioritize based on their operational context.

Oversight's threat model prioritizes robustness (leak attribution) over anti-spoofing, because the primary use case is identifying which recipient leaked a document, not proving document authenticity to third parties. The spoofing risk (an attacker fabricating a watermarked document to frame another recipient) is mitigated at the protocol level rather than the watermark level. The signed manifest binds each mark_id to a specific recipient, issuer, and timestamp. The transparency log records the binding before the document is delivered. An attacker who forges a watermark into a document cannot produce a valid manifest signature from the original issuer, so the forged mark would not resolve in the registry. This compensation shifts the No Free Lunch balance toward robustness without leaving spoofing completely unaddressed, though it does require the registry to be consulted during attribution.

9. LLM-Based Stripping: The SIRA Attack and Current Defenses

LLM-based paraphrasing is the most dangerous attack vector in the current threat landscape because it is cheap (approximately $0.88 per million tokens at 2026 pricing) and effective against all known text watermarking methods. The Self-Information Rewrite Attack (SIRA), introduced by Cheng et al. [4] and published at ICML 2025, achieves nearly 100% attack success against seven recent watermarking schemes. SIRA works by identifying high-entropy tokens (which are most likely to carry watermark information) using self-information calculations, masking them, and using an LLM to fill in replacements. The fundamental insight is that watermarks necessarily concentrate signal in high-entropy positions because low-entropy tokens are too predictable to carry information. SIRA exploits this structural weakness without requiring access to the watermark algorithm or the watermarked LLM.

Methods tested and defeated by SIRA include KGW [1], the Unigram watermark, the EXP watermark (Kuditipudi et al.), and SIR [8]. Google's SynthID-Text also shows sharp detection accuracy drops under light paraphrasing, copy-paste modifications, and back-translation, as assessed by the SynGuard robustness study [10].

Several defense mechanisms have been proposed specifically for LLM paraphrasing resistance. SemaMark [7] (NAACL 2024) uses sentence-level semantic embeddings instead of token-level hashes for watermark generation, so paraphrases that preserve meaning land in the same semantic region. SEMSTAMP [11] uses locality-sensitive hashing of sentence embeddings with rejection sampling. SimMark [12] (EMNLP 2025) operates at the sentence level rather than the token level. DualGuard [9] is the first algorithm designed to defend against both paraphrase and spoofing attacks simultaneously, using an adaptive dual-stream mechanism. All of these are designed for watermarking LLM-generated text at generation time, not for post-hoc watermarking of existing documents.

Oversight's problem is structurally different: watermarking existing documents after they are written, without control over the generation process. The generation-time approaches (green-list/red-list partitioning, rejection sampling of sentence candidates) are not directly applicable. Oversight's defense against LLM stripping relies on three properties of its architecture. First, L3's synonym rotation distributes signal across many low-salience word choices, making the per-token information density low enough that SIRA-style targeted attacks must modify a large fraction of the text to succeed. Second, the ECC layer tolerates the resulting moderate bit errors. Third, the content fingerprint database provides attribution even when the watermark is fully destroyed, provided the leaked text retains sufficient structural similarity.

The honest assessment: against aggressive LLM rewriting where the attacker rewrites every sentence, no embedded watermark survives. The Shannon capacity at p = 0.5 is zero, and no engineering can overcome this. The fingerprint database's effectiveness also degrades because sentence hashes change under full rewriting. At this extreme, the remaining defense is canary content (unique facts per recipient), which requires author cooperation and is not automatable. The protocol does not claim to solve this problem. It claims to make unattributed leaking more expensive and to provide probabilistic attribution when the adversary's rewriting is imperfect.

10. Limitations and Future Work

Several limitations of the current evaluation and implementation should be noted. The benchmarks were conducted on a single hardware platform (Intel Core i7, Windows 10, CPython 3.14.2); performance characteristics may differ on other platforms, particularly under the Rust port. The L3 verification scores were measured using synthetic test documents generated by the benchmark script, not natural-language corpora; the synonym-class hit rate in real documents varies with vocabulary composition and domain. The ECC evaluation used only repetition codes; algebraic codes (BCH, Reed-Solomon) offer better bandwidth efficiency and are planned for a future release.

The most significant gap in the current system is the absence of semantic embedding fingerprints. The existing fingerprint database uses winnowing (character k-grams) and sentence hashing (sorted content words), both of which fail under paraphrasing. Integrating sentence-transformer embeddings (e.g., all-MiniLM-L6-v2, 384 dimensions) would provide a paraphrase-resistant fingerprint type, since modern sentence transformers achieve greater than 0.85 cosine similarity for meaning-preserving paraphrases. This integration is planned for v0.5.0.

Additional planned improvements include: syntactic tree watermarks (modernized Atallah approach [2] using spaCy dependency parsing, estimated to add 10 to 20 bits per page that survive format conversion), Bayesian multi-signal fusion to replace independent per-layer thresholds, and a locality-sensitive hashing index for O(1) leak identification across large registries. The long-term research goal is a submission to USENIX Security 2026, Cycle 2 (deadline June 2026), covering the protocol specification, threat model, watermark resilience measurements, and performance benchmarks across both implementations.

11. References

[1] J. Kirchenbauer, J. Geiping, Y. Wen, J. Katz, I. Miers, T. Goldstein. "A Watermark for Large Language Models." ICML 2023.

[2] M.J. Atallah, V. Raskin, M. Crogan, C. Hempelmann, F. Kerschbaum, D. Mohamed, S. Naik. "Natural Language Watermarking: Design, Analysis, and a Proof-of-Concept Implementation." IH 2001.

[3] Q. Pang, S. Hu, et al. "No Free Lunch in LLM Watermarking: Trade-offs in Watermarking Design Choices." NeurIPS 2024.

[4] A. Cheng, et al. "Revealing Weaknesses in Text Watermarking Through Self-Information Rewrite Attacks (SIRA)." ICML 2025.

[5] J. Kirchenbauer, J. Geiping, Y. Wen, T. Goldstein. "On the Reliability of Watermarks for Large Language Models." 2023.

[6] V. Sadasivan, et al. "Paraphrasing evades detectors of AI-generated text, but retrieval is an effective defense." NeurIPS 2023.

[7] J. Ren, H. Xu, et al. "A Robust Semantics-based Watermark for Large Language Model against Paraphrasing (SemaMark)." NAACL 2024 Findings.

[8] A. Liu, et al. "A Semantic Invariant Robust Watermark for Large Language Models." ICLR 2024.

[9] "DualGuard: Dual-stream Large Language Model Watermarking Defense against Paraphrase and Spoofing Attack." Dec 2025.

[10] "SynGuard: Robustness Assessment and Enhancement of Text Watermarking for Google's SynthID." 2025.

[11] A. Hou, et al. "SEMSTAMP: A Semantic Watermark with Paraphrastic Robustness." 2024.

[12] "SimMark: A Robust Sentence-Level Similarity-Based Watermark." EMNLP 2025.

[13] "We Can Hide More Bits: The Unused Watermarking Capacity in Theory and in Practice." 2024.

[14] Y. Qu, et al. "Provably Robust Multi-bit Watermarking for AI-generated Text via Error Correction Code." USENIX Security 2025.

[15] "The Coding Limits of Robust Watermarking for Generative Models." 2024.

[16] S. Schleimer, D.S. Wilkerson, A. Aiken. "Winnowing: Local Algorithms for Document Fingerprinting." SIGMOD 2003.

[17] "A Survey of Text Watermarking in the Era of Large Language Models." ACM Computing Surveys 2024.

[18] "Stylometric Watermarks vs. LLM Watermarks: Can We Really Trace AI Authorship?" Data Science Collective, 2025.

[19] "Stylometry recognizes human and LLM-generated texts in short samples." 2025.


This document describes watermark resilience as measured in Oversight v0.4.4. The anti-stripping research source document with full methodology is available in the repository (ANTI_STRIPPING_RESEARCH.md).