Performance Evaluation: Measured Throughput, Overhead, and Scaling
Systematic benchmarking of the Oversight seal/open pipeline, watermark embedding, extraction, fingerprinting, and ECC operations
Zion Boggan · April 2026 · Oversight Protocol v0.4.4
Abstract
This paper presents a systematic performance evaluation of the Oversight protocol v0.4.4, covering the complete seal/open pipeline, per-layer watermark embedding, watermark extraction, content fingerprint computation, file size overhead, and error-correcting code operations. All measurements were obtained on an Intel Core i7 (Family 6 Model 158) running Windows 10 under CPython 3.14.2, with each benchmark executed 10 times and reported as mean with standard deviation. Peak seal throughput reaches 253 MB/s at the 1 MB document level, dominated by XChaCha20-Poly1305 AEAD encryption. Watermark embedding adds 484% overhead at 1 MB, with L3 semantic processing (regex-based synonym matching across 151 classes) accounting for 85% of that cost. Content fingerprinting via winnowing operates at 3.35 seconds per megabyte. ECC encode and decode complete in sub-100-microsecond time. The evaluation identifies L3 regex replacement, L1 zero-width density, and winnowing k-gram computation as the three primary optimization targets.
1. Methodology
All benchmarks were executed on the following hardware and software configuration: Intel64 Family 6 Model 158 Stepping 9 (Intel Core i7), Windows 10 (AMD64), CPython 3.14.2. The benchmark script (bench_usenix.py) generates synthetic plaintext documents at four sizes (1 KB, 10 KB, 100 KB, and 1 MB) and executes each operation N=10 times, recording wall-clock time via time.perf_counter(). Results are reported as arithmetic mean and sample standard deviation. Minimum and maximum values are recorded for outlier identification. The benchmark script and raw output are available in the repository under PERFORMANCE_BENCHMARKS.md.
Two classes of measurement are reported. "Pipeline measurements" time the complete seal or open operation as experienced by a caller invoking the top-level API. "Component measurements" isolate individual stages (L1 embedding, L2 embedding, L3 embedding, content fingerprinting, ECC encode/decode, L3 verification) by timing their entry and exit points independently. Component measurements exclude Python import overhead and fixture setup.
The cryptographic backend is the cryptography library (wrapping OpenSSL) for X25519, Ed25519, HKDF-SHA256, and SHA-256, and PyNaCl (wrapping libsodium) for XChaCha20-Poly1305 AEAD. The watermarking, fingerprinting, and ECC modules are pure Python. The Rust port was not benchmarked for this evaluation; cross-language performance comparison is planned for a subsequent report.
2. Seal and Open Throughput
The seal operation constructs the manifest, performs X25519 key agreement, derives the DEK via HKDF-SHA256, encrypts the (watermarked) plaintext with XChaCha20-Poly1305, computes the SHA-256 content hash, signs the manifest with Ed25519, and serializes the binary container. The open operation parses the container, verifies the Ed25519 signature, performs X25519 key agreement, unwraps the DEK, decrypts the ciphertext, and verifies the SHA-256 content hash.
| Size | Seal Mean | Seal Stddev | Seal Throughput | Open Mean | Open Stddev | Open Throughput |
|---|---|---|---|---|---|---|
| 1 KB | 341 us | 136 us | 2.9 MB/s | 1.16 ms | 2.79 ms | 0.8 MB/s |
| 10 KB | 323 us | 9.7 us | 30.2 MB/s | 301 us | 4.6 us | 32.4 MB/s |
| 100 KB | 600 us | 25.3 us | 163 MB/s | 576 us | 23.9 us | 170 MB/s |
| 1 MB | 3.95 ms | 255 us | 253 MB/s | 3.93 ms | 78.3 us | 254 MB/s |
Several patterns emerge from these measurements. First, the 1 KB data points exhibit high variance (136 us stddev for seal, 2.79 ms stddev for open), attributable to Python interpreter startup overhead and JIT compilation in the cryptographic C extensions being amortized differently across short-duration runs. Second, at 10 KB and above, the variance stabilizes and throughput increases monotonically, reaching 253 MB/s (seal) and 254 MB/s (open) at 1 MB. This convergence toward a common throughput reflects the fact that both operations are dominated at large sizes by the same primitive: XChaCha20-Poly1305, which processes data at a rate determined by libsodium's stream cipher implementation.
The constant-overhead component (X25519 key agreement, Ed25519 signing/verification, HKDF derivation, manifest serialization) accounts for approximately 200 to 300 microseconds regardless of document size. At 1 KB, this constant overhead dominates. At 1 MB, it is negligible relative to the ~3.7 ms spent in AEAD encryption/decryption. The crossover point, where linear AEAD cost exceeds constant crypto overhead, occurs near 100 KB.
3. Watermark Embedding Overhead
Watermark embedding occurs before the cryptographic seal. The following table reports the seal time with and without watermarking, isolating the overhead attributable to watermark insertion.
| Size | Seal (no watermark) | Seal (with watermark) | Overhead |
|---|---|---|---|
| 1 KB | 297 us | 305 us | +2.6% |
| 10 KB | 325 us | 471 us | +44.8% |
| 100 KB | 627 us | 2.76 ms | +340% |
| 1 MB | 4.07 ms | 23.78 ms | +484% |
The percentage overhead grows with document size because watermark embedding is O(n) in text length while the constant cryptographic cost is amortized over the same range. At 1 KB the watermark adds only 8 microseconds (within measurement noise); at 1 MB it adds 19.7 milliseconds. Despite the large percentage increase, the absolute overhead at 1 MB (23.78 ms total) remains well below any interactive latency threshold.
3.1 Per-Layer Breakdown
Isolating each watermark layer reveals that L3 semantic processing dominates the embedding cost at all document sizes.
| Size | L1 (Zero-Width) | L2 (Whitespace) | L3 (Semantic) | All Layers | L3 Share |
|---|---|---|---|---|---|
| 1 KB | 230 us | 20.5 us | 1.39 ms | 1.61 ms | 86% |
| 10 KB | 1.88 ms | 66 us | 12.5 ms | 13.9 ms | 90% |
| 100 KB | 19.5 ms | 401 us | 122 ms | 141 ms | 87% |
| 1 MB | 213 ms | 3.72 ms | 1.21 s | 1.42 s | 85% |
L3 accounts for 85 to 90% of total embedding time across all sizes. Its cost is driven by five sequential regex-based passes over the text (synonym rotation, punctuation, spelling, contractions, number formatting), each of which performs word-boundary scanning and dictionary lookup. L1 is the second-largest contributor, scaling linearly because it inserts a 66-character zero-width frame every 40 visible characters, requiring O(n) string concatenation. L2 is the cheapest layer by an order of magnitude because it modifies at most 64 lines regardless of document length, making it effectively O(1) for documents longer than approximately 200 lines.
4. Watermark Extraction Performance
Extraction is the inverse of embedding: given a text that may contain watermarks, recover candidate mark_ids. Two API entry points exist: recover_marks() (v1, extracts L1 and L2 candidates) and recover_marks_v2() (adds L3 candidate verification against a provided list of candidate mark_ids).
| Size | recover_marks() | recover_marks_v2() (no L3 candidates) | recover_marks_v2() (with L3 candidate) |
|---|---|---|---|
| 1 KB | 1.24 ms | 1.25 ms | 2.25 ms |
| 10 KB | 12.8 ms | 12.8 ms | 21.9 ms |
| 100 KB | 128 ms | 126 ms | 217 ms |
| 1 MB | 1.31 s | 1.32 s | 2.25 s |
Without L3 candidate verification, the v1 and v2 APIs perform identically (within measurement noise), as both perform the same L1 and L2 extraction passes. Adding a single L3 candidate for verification approximately doubles the extraction time because verify_semantic() must scan the full text against the 151-class synonym dictionary and score the observed choices against the candidate mark_id. The L3 verification cost scales linearly with text length and linearly with the number of candidates tested. For attribution workflows where the registry contains thousands of mark_ids, a pre-filtering step (using L1 or L2 recovery to narrow the candidate set) is essential to avoid O(n * k) scaling where k is the candidate count.
The L3 verification times from the benchmark confirm this linear relationship:
| Size | L3 Verify (correct mark_id) | L3 Verify (wrong mark_id) |
|---|---|---|
| 1 KB | 961 us | 958 us |
| 10 KB | 9.10 ms | 9.22 ms |
| 100 KB | 90.5 ms | 90.3 ms |
| 1 MB | 986 ms | 945 ms |
Verification time is independent of whether the candidate mark_id is correct or incorrect, as the algorithm must scan the full text and score all synonym-class instances in both cases. The time per verification is approximately 1 ms/KB, or equivalently ~1 second per megabyte of text. For a 10 KB document with 100 candidates to test, the total attribution time would be approximately 0.9 seconds, which is acceptable for forensic use.
5. Content Fingerprint Cost
Content fingerprinting runs once during seal (to store the fingerprint in the registry) and once during attribution (to compare the leaked document against stored fingerprints). The ContentFingerprint.from_text() function computes winnowing hashes over character k-grams and SHA-256 sentence hashes over sorted content words.
| Size | Fingerprint Time | Stddev | Winnowing Hashes | Sentence Hashes | Rate |
|---|---|---|---|---|---|
| 1 KB | 3.37 ms | 76 us | 378 | 14 | 290 KB/s |
| 10 KB | 32.0 ms | 105 us | 477 | 146 | 306 KB/s |
| 100 KB | 321 ms | 1.68 ms | 477 | 1,451 | 304 KB/s |
| 1 MB | 3.35 s | 27.6 ms | 477 | 14,862 | 306 KB/s |
Content fingerprinting is the most expensive per-byte operation in the Oversight pipeline, processing at approximately 300 KB/s (3.35 s/MB). The cost is dominated by the rolling hash computation in the winnowing algorithm, which computes one MD5 hash per k-gram position. The winnowing hash count plateaus at 477 for documents above 10 KB because the window-based selection algorithm converges to a stable density of selected hashes. Sentence hash counts scale linearly with document length, as expected, reaching 14,862 for a 1 MB document (approximately one hash per 70 bytes, corresponding to an average sentence length of roughly 15 words).
The fingerprint computation runs once at seal time and does not affect interactive latency for the seal operation itself (it can be computed asynchronously after the sealed file is written). At attribution time, the same computation runs on the leaked document. For a 100 KB leaked document, the 321 ms fingerprint computation time is imperceptible. For 1 MB documents, the 3.35-second cost is noticeable but acceptable for a forensic workflow that is not time-critical.
6. File Size Overhead
Watermark embedding and sealed container packaging both increase the stored file size. The following table reports the measured byte counts at each stage.
| Nominal | Plaintext | Sealed (no WM) | Container Overhead | Watermarked Text | WM Expansion | WM + Sealed |
|---|---|---|---|---|---|---|
| 1 KB | 1,024 B | 2,148 B | +1,124 B | 6,227 B | 5.08x | 7,330 B |
| 10 KB | 10,240 B | 11,365 B | +1,125 B | 63,576 B | 5.21x | 64,680 B |
| 100 KB | 102,400 B | 103,526 B | +1,126 B | 635,500 B | 5.21x | 636,605 B |
| 1 MB | 1,048,576 B | 1,049,703 B | +1,127 B | 6,506,689 B | 5.20x | 6,507,795 B |
The sealed container format adds a fixed overhead of approximately 1,125 bytes across all document sizes. This overhead comprises the 6-byte magic sequence (OVSGHT), the 2-byte header (version and suite ID), the manifest JSON (approximately 500 bytes for a typical seal), the wrapped DEK JSON (approximately 150 bytes), the 24-byte AEAD nonce, and the 16-byte Poly1305 authentication tag. At 1 MB, this fixed overhead is 0.1% of the plaintext size.
The watermark text expansion is the dominant size contributor, measured at approximately 5.2x across all tested sizes. L1 is responsible for nearly all of this expansion. Each zero-width character (U+200B, U+200C, or U+200D) occupies 3 bytes in UTF-8 encoding, and L1 inserts a 66-character frame (198 bytes in UTF-8) every 40 visible characters. This yields a theoretical expansion ratio of approximately (40 + 198) / 40 = 5.95x for L1 alone, slightly reduced in practice because not every position receives a full frame. L2 adds at most 64 bytes (one per modified line), and L3 produces near-zero net size change because synonym substitutions replace words with alternatives of similar length.
If the 5.2x expansion is unacceptable for a given deployment, L1 density can be reduced (e.g., one frame per 100 visible characters instead of 40), trading redundancy for size. A future optimization is under consideration: encoding the L1 frame in a binary-packed format using only two zero-width characters (reducing from 3 to 2 characters per bit), which would decrease L1's contribution by approximately 33%.
7. ECC Performance
The error-correcting code module uses a repetition code with majority-vote decoding to protect L3 synonym bits against partial bit errors from paraphrasing. ECC operations are performed on the mark_id payload before embedding (encode) and after extraction (decode). The following table reports timing for R=7 (the default configuration), which produces 448 coded bits from a 64-bit (8-byte) payload.
| Payload Size | Coded Bits | Encode Mean | Encode Stddev | Decode Mean | Decode Stddev | Decode w/ 20% Errors |
|---|---|---|---|---|---|---|
| 8 bytes (64-bit mark_id) | 448 | 23.6 us | 1.0 us | 50.8 us | 0.7 us | 51.5 us |
| 16 bytes (128-bit mark_id) | 896 | 44.6 us | 1.1 us | 99.8 us | 0.5 us | 104 us |
| 32 bytes (256-bit mark_id) | 1,792 | 85.2 us | 0.8 us | 202 us | 10.9 us | 206 us |
All ECC operations complete in sub-100-microsecond time for the standard 64-bit payload, and sub-250 microseconds even for an extended 256-bit payload. The decode operation takes roughly twice as long as encode because it must compute majority votes across each group of R bits. The presence of 20% random bit errors does not measurably increase decode time, confirming that the majority-vote computation is dominated by the iteration overhead rather than by error-correction logic.
ECC overhead is negligible relative to all other pipeline stages. At 51 microseconds for decode (the more expensive direction), it represents less than 0.004% of the total extraction time for a 1 MB document (1.31 seconds). The decision to use a repetition code rather than an algebraic code (BCH, Reed-Solomon) was driven by implementation simplicity rather than performance: both approaches would be sub-millisecond at these payload sizes.
8. Optimization Targets
The benchmark data identifies three primary optimization targets for the Python reference implementation.
L3 regex replacement. L3 accounts for 85% of watermark embedding time and is bottlenecked by five sequential regex passes over the full text. The synonym rotation pass alone (T1) compiles 151 regex patterns and applies each one. Precompiling all patterns into a single alternation regex, or replacing the regex engine with a direct string scan using a prefix trie, would reduce L3's per-pass cost. The Rust port already uses compiled regex via the regex crate with Aho-Corasick multi-pattern matching, which is expected to provide a 10 to 20x speedup.
L1 density reduction. L1 is responsible for the 5.2x file size expansion through insertion of zero-width character frames every 40 visible characters. Reducing the insertion density to every 100 or 200 characters would cut the expansion to approximately 2.5x or 1.6x respectively, at the cost of fewer redundant frames for recovery. For most documents, the lost redundancy is acceptable because L1 recovery is binary (either the frames are present or they are not) and one complete frame is sufficient for full mark_id extraction.
Winnowing k-gram computation. Content fingerprinting at 3.35 s/MB is the slowest per-byte operation in the pipeline. The bottleneck is the rolling hash computation, which currently uses Python's hashlib.md5() for each k-gram position. Switching to a polynomial rolling hash (Rabin fingerprint) computed in a single pass would reduce this to a constant-factor improvement. Alternatively, the fingerprinting computation can be deferred to an asynchronous post-seal task, removing it from the critical path entirely.
9. Comparison with Related Systems
No direct, apples-to-apples benchmark comparison is possible between Oversight and existing commercial systems (C2PA, Digimarc) because those systems are proprietary, target different media types (primarily images and video), and do not publish comparable benchmark data. The following qualitative comparison situates Oversight's performance characteristics within the broader landscape.
| Property | Oversight (v0.4.4, Python) | C2PA | Digimarc |
|---|---|---|---|
| Primary target | Text documents | Images, video, audio, documents | Images, packaging, audio |
| Watermark type | Multi-layer (formatting + semantic) | None (metadata-only provenance) | Spatial-domain imperceptible marks |
| Survives metadata stripping | Yes (L3 is in content words) | No (provenance is metadata) | Yes (marks are in pixel data) |
| Cryptographic sealing | AEAD encryption + signature | Signature chain (no encryption) | No cryptographic sealing |
| Transparency logging | RFC 6962 Merkle tree | Cloud-based verification | Proprietary registry |
| Open specification | Yes (Apache 2.0) | Yes (open standard) | No (proprietary) |
| Recipient binding | Per-recipient encryption + watermark | No per-recipient binding | No per-recipient binding |
| Post-quantum readiness | ML-KEM-768 + ML-DSA-65 (hybrid suite) | Not specified | Not specified |
C2PA and Oversight address related but distinct problems. C2PA provides a provenance chain for digital media, recording the editing history through cryptographic signatures on metadata. It does not watermark the content itself, and all provenance data is lost when metadata is stripped. Oversight embeds attribution signals directly into the content, providing forensic traceability that survives metadata removal. The two systems are potentially complementary: a C2PA provenance chain could wrap an Oversight-sealed document, providing both editing-history provenance and per-recipient attribution.
Digimarc operates in a fundamentally different domain (spatial-domain marks in images and packaging) and is not directly comparable to a text watermarking system. Its marks survive printing and scanning, a capability that is analogous to Oversight's L3 surviving OCR. Both systems face the same fundamental limitation: marks embedded in the content survive content-preserving transformations but not content-altering ones (cropping for images, paraphrasing for text).
Performance measurements from Oversight v0.4.4 on Intel Core i7, CPython 3.14.2, Windows 10.
The benchmark script (bench_usenix.py) and full raw data are available in the
repository.