Zion Boggan · April 2026 · Oversight Protocol v0.4.5

Overview

Oversight is structured as a pipeline: plaintext enters, gets watermarked per-recipient, encrypted to that recipient's public key, timestamped by an independent authority, logged in a Merkle tree, and emitted as a binary .sealed container. The reverse operation (open) parses the container, verifies the signature and timestamp, enforces policy, decrypts, and returns the plaintext with watermark metadata intact for later forensic recovery.

Two implementations exist in parallel. The Python reference (5,689 LOC) prioritizes correctness and readability. The Rust port (2,934 LOC across 9 crates) prioritizes performance and memory safety. Both produce bit-identical output for the same inputs, verified by 3 cross-language conformance tests. The two implementations share no code; they are independently written against the same specification.

The Seal Pipeline

Plaintext Watermark (L1/L2/L3) Sign Manifest Derive & Wrap DEK AEAD Encrypt Timestamp Log .sealed

The seal operation begins with optional watermark embedding. If watermarking is requested, each recipient's copy gets a unique mark_id derived from their public key fingerprint. Layer 1 inserts zero-width Unicode characters at deterministic positions. Layer 2 encodes attribution bits in trailing whitespace. Layer 3 performs synonym rotation against a 151-class controlled vocabulary, producing marks that survive retyping and format conversion.

After watermarking, the caller constructs a Manifest containing the file ID, issuer fingerprint, recipient fingerprint, watermark mark_id, beacon tokens, and policy constraints. The manifest is serialized as canonical JSON (sorted keys, no optional whitespace) and signed with the issuer's Ed25519 key.

A random 256-bit document encryption key (DEK) is generated. The seal operation performs an X25519 key agreement between an ephemeral keypair and the recipient's public key, derives a wrapping key via HKDF-SHA256, and encrypts the DEK under that wrapping key. In the hybrid suite, ML-KEM-768 encapsulation runs alongside X25519 and both shared secrets feed into the HKDF derivation.

The watermarked plaintext is encrypted with XChaCha20-Poly1305 using the DEK. The AEAD additional data is the SHA-256 hash of the manifest, binding the ciphertext to its metadata. The sealed container bundles the signed manifest, wrapped DEK (including the ephemeral public key), and ciphertext into a binary format prefixed by the six-byte magic OVSGHT.

Before delivery, the content hash goes to an RFC 3161 timestamp authority (FreeTSA primary, DigiCert fallback) and the seal event is appended to the Merkle transparency log. The timestamp token and inclusion proof are recorded in the manifest.

The Open Pipeline

.sealed Parse Verify Signature Check Policy Unwrap DEK Decrypt Verify Hash Plaintext

The open operation parses the binary container, extracts the manifest, and verifies the issuer's Ed25519 signature. If a trusted-issuer list is provided, the manifest's issuer fingerprint is checked against it, rejecting seals from unknown issuers.

Policy enforcement runs before decryption. The client checks not_before and not_after time windows against the system clock, evaluates jurisdiction constraints via IP geolocation, and increments the max_opens counter atomically (advisory file lock plus atomic rename to prevent TOCTOU races). If any policy check fails, decryption does not proceed and the violation is logged.

The DEK is unwrapped by performing X25519 key agreement between the recipient's private key and the ephemeral public key stored in the container. After HKDF derivation, the DEK decrypts the ciphertext via XChaCha20-Poly1305. A post-decrypt SHA-256 check confirms the content matches the hash in the manifest. The function returns the plaintext and the parsed manifest, with watermark metadata available for forensic extraction if the document later leaks.

Rust Crate Structure

The Rust implementation is organized as a Cargo workspace with 9 crates. Each crate has a single responsibility, and dependencies flow downward (crypto at the bottom, CLI at the top).

Crate LOC Responsibility
oversight-crypto 395 X25519 key agreement, Ed25519 signing, XChaCha20-Poly1305 AEAD, HKDF-SHA256 key derivation. Post-quantum hooks reserved.
oversight-manifest 236 Canonical JSON manifest: file_id, issuer/recipient fingerprints, watermark mark_id, beacons, policy. Ed25519 signing and verification.
oversight-container 487 Binary .sealed format: magic bytes, signed manifest, wrapped DEK, AEAD ciphertext. Implements seal() and open_sealed().
oversight-watermark 210 L1 (zero-width unicode) and L2 (trailing whitespace) watermark embedding and extraction.
oversight-semantic 530 L3 synonym rotation watermarking. 151-class dictionary. Skip regions for URLs, email, code, hex, base64.
oversight-tlog 486 RFC 6962 Merkle tree. Signed tree heads, inclusion proofs, durable append with fsync, automatic recovery on reopen.
oversight-policy 351 Policy enforcement: time windows, max_opens (TOCTOU-safe via flock + atomic rename), jurisdiction checks, file-ID sanitization.
oversight-rekor 535 Sigstore Rekor v2 client. DSSE envelope construction, attestation signing, optional upload to Rekor instance.
oversight-cli 222 Command-line interface: keygen, seal, open, inspect subcommands.

The dependency graph is intentionally narrow. oversight-crypto has no in-workspace dependencies. oversight-manifest depends only on oversight-crypto. oversight-container depends on oversight-crypto and oversight-manifest. The watermark and policy crates are independent of the crypto layer. oversight-cli sits at the top and imports everything.

External cryptographic dependencies use RustCrypto crates (x25519-dalek, ed25519-dalek, chacha20poly1305, hkdf, sha2) for the classical suite. Post-quantum algorithms use pqcrypto bindings. No custom cryptographic constructions exist anywhere in the codebase.

Python Module Layout

The Python reference implementation lives in oversight_core/ and mirrors the Rust crate structure, though as a flat module layout rather than separate packages.

Module LOC Responsibility
crypto.py 337 All cryptographic operations. Uses the cryptography library (OpenSSL backend) and oqs-python for ML-KEM/ML-DSA.
manifest.py 178 Manifest dataclass and canonical JSON serialization. Ed25519 signing integrated.
container.py 277 Binary format encoding and parsing. seal() and open_sealed() entry points.
watermark.py 208 L1 and L2 watermark insertion and extraction.
semantic.py 496 L3 synonym rotation. 151-class dictionary with skip regions for non-prose content.
l3_policy.py 214 L3 safety policy added in v0.4.5. Document-class detection (legal, regulatory, technical, source code, SQL, log, structured, prose), mode selection (auto, off, full, boilerplate), and the acknowledgement gate that blocks body-text rewrites without --l3-ack.
cli/gui.py 212 Tkinter desktop starter added in v0.4.5. Three-pane workflow: generate identity, seal to recipient, open a sealed file. Invoked via oversight gui or the oversight-gui entry point. Uses only the standard library Tkinter toolkit.
tlog.py 239 RFC 6962 Merkle tree. Append-only with signed tree heads and inclusion proofs.
policy.py 170 Policy evaluation: time windows, max_opens, jurisdiction. Supports LOCAL_ONLY, REGISTRY, and HYBRID modes.
rekor.py 425 Sigstore Rekor v2 DSSE client. Bit-identical behavior to the Rust oversight-rekor crate.
beacon.py 110 Canary token generation for DNS, HTTP, OCSP, and license-check beacons.
timestamp.py 156 RFC 3161 timestamp requests. FreeTSA primary, DigiCert fallback, self-signed stub if both fail.
decoy.py 225 Decoy file generation for misdirection. Structurally mimics legitimate sealed files.
formats/ 429 Format adapters for text, PDF, DOCX, and image (LSB in Y channel). Each normalizes content for watermarking.

Attribution Registry

The registry is a FastAPI server (729 LOC in registry/server.py) that stores seal metadata and answers attribution queries. When an issuer seals a document, the client optionally registers the manifest, watermark mark_ids, and beacon token_ids with the registry. Later, if a leaked document surfaces, the attribution pipeline extracts watermark marks from the text and queries the registry to resolve them to a specific recipient.

The registry exposes the surface documented in docs/spec/registry-v1.md. POST /register stores a sealed file's manifest, watermarks, and beacons after verifying the issuer's Ed25519 signature. POST /attribute resolves a token_id, mark_id, or perceptual hash to the responsible recipient. GET /evidence/{file_id} emits a signed bundle carrying the manifest, events, and transparency-log proofs. POST /dns_event ingests beacon callbacks from the DNS server with a shared-secret auth. GET /health and GET /.well-known/oversight-registry publish liveness and identity. Every endpoint is exercised by the conformance harness at tests/test_registry_conformance.py, which federated operators point at their own deployments to claim v1 compatibility.

Data lives in SQLite with WAL mode for concurrent reads. Three tables track the core relationships: beacons maps token_id to file/recipient/issuer, watermarks maps mark_id and layer to file/recipient, and manifests stores the canonical manifest JSON with its signature. The registry maintains its own Ed25519 identity for signing log entries and rate-limits API calls with X-Forwarded-For support for reverse-proxy deployments.

DNS Beacon Server

The DNS beacon server (oversight_dns/server.py, 142 LOC) is an authoritative nameserver for a beacon domain. When a sealed document contains a DNS beacon, the embedded token fires a DNS lookup against <token_id>.t.<beacon_domain> during document preview. The nameserver logs the query and forwards it to the registry via POST /dns_event, attributing the open event to the recipient whose token_id matches.

DNS beacons fire before HTTP in most document-preview pipelines, because tools resolve linked resources before fetching them. This gives DNS beacons the earliest detection window. The server answers every query with a generic A record to satisfy the resolver, regardless of whether the token_id is valid.

The registry-side POST /dns_event endpoint is hardened in both the Python and Rust registries. Non-loopback callbacks must carry a shared secret set through the OVERSIGHT_DNS_EVENT_SECRET environment variable, and requests that arrive without it fail closed instead of recording unattributed open events. Loopback callers (the DNS server running on the same host as the registry) are trusted without the secret. The Rust oversight-registry crate mirrors the Python semantics so a federated registry operator inherits the same posture. Absence of a beacon is not evidence of no leak; beacons are forensic telemetry, and corporate egress filtering, air-gapped readers, or sandboxed previews can suppress them. The threat model document at research/threat-model.html states this limit explicitly.

Cross-Language Conformance

Three conformance tests verify that the Python and Rust implementations produce identical output for the same inputs. Each test uses fixed test vectors with predetermined keys, nonces, and plaintext, eliminating randomness. The sealed container bytes from Python are compared against the sealed container bytes from Rust. Any divergence fails the test.

The conformance tests caught two real bugs during development. The first was an RFC 6962 Merkle tree split algorithm mismatch: the Python implementation used a promote-odd-trailing shortcut that differed from the canonical largest-power-of-2 left-heavy split. The second was a synonym round-trip bug where L3 embed could select hyphenated variants that the tokenizer later split into separate words, desyncing the verification sequence. Both were fixed in v0.4.0.

What's Shipped and What's Next

v0.4.5 (shipped): L3 semantic watermark safety via l3_policy.py, manifest fields canonical_content_hash and l3_policy for dispute resolution and audit, Tkinter GUI starter (oversight gui), hardened GUI/CLI output writes, strict container parsing for suite-byte and trailing-data tamper, and a public threat model at research/threat-model.html. PyPI and Rust dependency floors were raised after a Dependabot follow-up.

v0.4.4 (shipped): Fail-closed hardening across nine security findings. Full writeup in the blog post Nine Security Findings and the Discipline of Failing Closed.

v0.5 (shipped): Replaced the self-hosted Merkle tree with a Sigstore Rekor v2 backend. Seal entries are wrapped in DSSE envelopes and submitted to rekor.sigstore.dev or a self-hosted deployment. The oversight-rekor crate and rekor.py module are live in the codebase, and the bundle format is sigstore-compatible (bundle_schema: 2).

v0.6 (shipped): Format adapters (text, PDF, DOCX, image LSB) ported from Python to Rust, producing a statically-linked binary that handles all supported document types without a Python runtime.

v1.0 (in progress): FastAPI registry port to Rust using Axum and SQLx (1,125 LOC, all endpoints implemented, #![forbid(unsafe_code)]). Registry federation spec (docs/spec/registry-v1.md) is a v1.0 prerequisite. Wire format stability declaration and specification freeze are the gating items for external implementors.


This document describes the architecture as of v0.4.5. The codebase is actively developed; consult the repository for the latest structure.