Architecture Overview
Component structure, data flow, and design rationale for Oversight v0.4.5
Zion Boggan · April 2026 · Oversight Protocol v0.4.5
Overview
Oversight is structured as a pipeline: plaintext enters, gets watermarked per-recipient,
encrypted to that recipient's public key, timestamped by an independent authority, logged
in a Merkle tree, and emitted as a binary .sealed container. The reverse operation
(open) parses the container, verifies the signature and timestamp, enforces policy, decrypts,
and returns the plaintext with watermark metadata intact for later forensic recovery.
Two implementations exist in parallel. The Python reference (5,689 LOC) prioritizes correctness and readability. The Rust port (2,934 LOC across 9 crates) prioritizes performance and memory safety. Both produce bit-identical output for the same inputs, verified by 3 cross-language conformance tests. The two implementations share no code; they are independently written against the same specification.
The Seal Pipeline
The seal operation begins with optional watermark embedding. If watermarking is requested,
each recipient's copy gets a unique mark_id derived from their public key
fingerprint. Layer 1 inserts zero-width Unicode characters at deterministic positions.
Layer 2 encodes attribution bits in trailing whitespace. Layer 3 performs synonym rotation
against a 151-class controlled vocabulary, producing marks that survive retyping and
format conversion.
After watermarking, the caller constructs a Manifest containing the file ID,
issuer fingerprint, recipient fingerprint, watermark mark_id, beacon tokens, and policy
constraints. The manifest is serialized as canonical JSON (sorted keys, no optional
whitespace) and signed with the issuer's Ed25519 key.
A random 256-bit document encryption key (DEK) is generated. The seal operation performs an X25519 key agreement between an ephemeral keypair and the recipient's public key, derives a wrapping key via HKDF-SHA256, and encrypts the DEK under that wrapping key. In the hybrid suite, ML-KEM-768 encapsulation runs alongside X25519 and both shared secrets feed into the HKDF derivation.
The watermarked plaintext is encrypted with XChaCha20-Poly1305 using the DEK. The AEAD
additional data is the SHA-256 hash of the manifest, binding the ciphertext to its metadata.
The sealed container bundles the signed manifest, wrapped DEK (including the ephemeral
public key), and ciphertext into a binary format prefixed by the six-byte magic
OVSGHT.
Before delivery, the content hash goes to an RFC 3161 timestamp authority (FreeTSA primary, DigiCert fallback) and the seal event is appended to the Merkle transparency log. The timestamp token and inclusion proof are recorded in the manifest.
The Open Pipeline
The open operation parses the binary container, extracts the manifest, and verifies the issuer's Ed25519 signature. If a trusted-issuer list is provided, the manifest's issuer fingerprint is checked against it, rejecting seals from unknown issuers.
Policy enforcement runs before decryption. The client checks not_before
and not_after time windows against the system clock, evaluates jurisdiction
constraints via IP geolocation, and increments the max_opens counter atomically
(advisory file lock plus atomic rename to prevent TOCTOU races). If any policy check fails,
decryption does not proceed and the violation is logged.
The DEK is unwrapped by performing X25519 key agreement between the recipient's private key and the ephemeral public key stored in the container. After HKDF derivation, the DEK decrypts the ciphertext via XChaCha20-Poly1305. A post-decrypt SHA-256 check confirms the content matches the hash in the manifest. The function returns the plaintext and the parsed manifest, with watermark metadata available for forensic extraction if the document later leaks.
Rust Crate Structure
The Rust implementation is organized as a Cargo workspace with 9 crates. Each crate has a single responsibility, and dependencies flow downward (crypto at the bottom, CLI at the top).
| Crate | LOC | Responsibility |
|---|---|---|
oversight-crypto |
395 | X25519 key agreement, Ed25519 signing, XChaCha20-Poly1305 AEAD, HKDF-SHA256 key derivation. Post-quantum hooks reserved. |
oversight-manifest |
236 | Canonical JSON manifest: file_id, issuer/recipient fingerprints, watermark mark_id, beacons, policy. Ed25519 signing and verification. |
oversight-container |
487 | Binary .sealed format: magic bytes, signed manifest, wrapped DEK, AEAD ciphertext. Implements seal() and open_sealed(). |
oversight-watermark |
210 | L1 (zero-width unicode) and L2 (trailing whitespace) watermark embedding and extraction. |
oversight-semantic |
530 | L3 synonym rotation watermarking. 151-class dictionary. Skip regions for URLs, email, code, hex, base64. |
oversight-tlog |
486 | RFC 6962 Merkle tree. Signed tree heads, inclusion proofs, durable append with fsync, automatic recovery on reopen. |
oversight-policy |
351 | Policy enforcement: time windows, max_opens (TOCTOU-safe via flock + atomic rename), jurisdiction checks, file-ID sanitization. |
oversight-rekor |
535 | Sigstore Rekor v2 client. DSSE envelope construction, attestation signing, optional upload to Rekor instance. |
oversight-cli |
222 | Command-line interface: keygen, seal, open, inspect subcommands. |
The dependency graph is intentionally narrow. oversight-crypto has no
in-workspace dependencies. oversight-manifest depends only on
oversight-crypto. oversight-container depends on
oversight-crypto and oversight-manifest. The watermark and
policy crates are independent of the crypto layer. oversight-cli sits at the
top and imports everything.
External cryptographic dependencies use RustCrypto crates (x25519-dalek,
ed25519-dalek, chacha20poly1305, hkdf, sha2)
for the classical suite. Post-quantum algorithms use pqcrypto bindings.
No custom cryptographic constructions exist anywhere in the codebase.
Python Module Layout
The Python reference implementation lives in oversight_core/ and mirrors the
Rust crate structure, though as a flat module layout rather than separate packages.
| Module | LOC | Responsibility |
|---|---|---|
crypto.py |
337 | All cryptographic operations. Uses the cryptography library (OpenSSL backend) and oqs-python for ML-KEM/ML-DSA. |
manifest.py |
178 | Manifest dataclass and canonical JSON serialization. Ed25519 signing integrated. |
container.py |
277 | Binary format encoding and parsing. seal() and open_sealed() entry points. |
watermark.py |
208 | L1 and L2 watermark insertion and extraction. |
semantic.py |
496 | L3 synonym rotation. 151-class dictionary with skip regions for non-prose content. |
l3_policy.py |
214 | L3 safety policy added in v0.4.5. Document-class detection (legal, regulatory, technical, source code, SQL, log, structured, prose), mode selection (auto, off, full, boilerplate), and the acknowledgement gate that blocks body-text rewrites without --l3-ack. |
cli/gui.py |
212 | Tkinter desktop starter added in v0.4.5. Three-pane workflow: generate identity, seal to recipient, open a sealed file. Invoked via oversight gui or the oversight-gui entry point. Uses only the standard library Tkinter toolkit. |
tlog.py |
239 | RFC 6962 Merkle tree. Append-only with signed tree heads and inclusion proofs. |
policy.py |
170 | Policy evaluation: time windows, max_opens, jurisdiction. Supports LOCAL_ONLY, REGISTRY, and HYBRID modes. |
rekor.py |
425 | Sigstore Rekor v2 DSSE client. Bit-identical behavior to the Rust oversight-rekor crate. |
beacon.py |
110 | Canary token generation for DNS, HTTP, OCSP, and license-check beacons. |
timestamp.py |
156 | RFC 3161 timestamp requests. FreeTSA primary, DigiCert fallback, self-signed stub if both fail. |
decoy.py |
225 | Decoy file generation for misdirection. Structurally mimics legitimate sealed files. |
formats/ |
429 | Format adapters for text, PDF, DOCX, and image (LSB in Y channel). Each normalizes content for watermarking. |
Attribution Registry
The registry is a FastAPI server (729 LOC in registry/server.py) that stores
seal metadata and answers attribution queries. When an issuer seals a document, the client
optionally registers the manifest, watermark mark_ids, and beacon token_ids with the registry.
Later, if a leaked document surfaces, the attribution pipeline extracts watermark marks from
the text and queries the registry to resolve them to a specific recipient.
The registry exposes the surface documented in
docs/spec/registry-v1.md.
POST /register stores a sealed file's manifest, watermarks, and beacons after
verifying the issuer's Ed25519 signature. POST /attribute resolves a
token_id, mark_id, or perceptual hash to the responsible
recipient. GET /evidence/{file_id} emits a signed bundle carrying the
manifest, events, and transparency-log proofs. POST /dns_event ingests
beacon callbacks from the DNS server with a shared-secret auth. GET /health
and GET /.well-known/oversight-registry publish liveness and identity.
Every endpoint is exercised by the conformance harness at
tests/test_registry_conformance.py, which federated operators point at
their own deployments to claim v1 compatibility.
Data lives in SQLite with WAL mode for concurrent reads. Three tables track the core
relationships: beacons maps token_id to file/recipient/issuer,
watermarks maps mark_id and layer to file/recipient, and
manifests stores the canonical manifest JSON with its signature.
The registry maintains its own Ed25519 identity for signing log entries and
rate-limits API calls with X-Forwarded-For support for reverse-proxy deployments.
DNS Beacon Server
The DNS beacon server (oversight_dns/server.py, 142 LOC) is an authoritative
nameserver for a beacon domain. When a sealed document contains a DNS beacon, the embedded
token fires a DNS lookup against <token_id>.t.<beacon_domain>
during document preview. The nameserver logs the query and forwards it to the registry
via POST /dns_event, attributing the open event to the recipient whose
token_id matches.
DNS beacons fire before HTTP in most document-preview pipelines, because tools resolve linked resources before fetching them. This gives DNS beacons the earliest detection window. The server answers every query with a generic A record to satisfy the resolver, regardless of whether the token_id is valid.
The registry-side POST /dns_event endpoint is hardened in both the Python
and Rust registries. Non-loopback callbacks must carry a shared secret set through the
OVERSIGHT_DNS_EVENT_SECRET environment variable, and requests that arrive
without it fail closed instead of recording unattributed open events. Loopback callers
(the DNS server running on the same host as the registry) are trusted without the
secret. The Rust oversight-registry crate mirrors the Python semantics so
a federated registry operator inherits the same posture. Absence of a beacon is not
evidence of no leak; beacons are forensic telemetry, and corporate egress filtering,
air-gapped readers, or sandboxed previews can suppress them. The threat model document
at research/threat-model.html states this
limit explicitly.
Cross-Language Conformance
Three conformance tests verify that the Python and Rust implementations produce identical output for the same inputs. Each test uses fixed test vectors with predetermined keys, nonces, and plaintext, eliminating randomness. The sealed container bytes from Python are compared against the sealed container bytes from Rust. Any divergence fails the test.
The conformance tests caught two real bugs during development. The first was an RFC 6962 Merkle tree split algorithm mismatch: the Python implementation used a promote-odd-trailing shortcut that differed from the canonical largest-power-of-2 left-heavy split. The second was a synonym round-trip bug where L3 embed could select hyphenated variants that the tokenizer later split into separate words, desyncing the verification sequence. Both were fixed in v0.4.0.
What's Shipped and What's Next
v0.4.5 (shipped): L3 semantic watermark safety via
l3_policy.py, manifest fields canonical_content_hash and
l3_policy for dispute resolution and audit, Tkinter GUI starter
(oversight gui), hardened GUI/CLI output writes, strict container
parsing for suite-byte and trailing-data tamper, and a public threat model at
research/threat-model.html. PyPI
and Rust dependency floors were raised after a Dependabot follow-up.
v0.4.4 (shipped): Fail-closed hardening across nine security findings. Full writeup in the blog post Nine Security Findings and the Discipline of Failing Closed.
v0.5 (shipped): Replaced the self-hosted Merkle tree with a Sigstore
Rekor v2 backend. Seal entries are wrapped in DSSE envelopes and submitted to
rekor.sigstore.dev or a self-hosted deployment. The
oversight-rekor crate and rekor.py module are live in the
codebase, and the bundle format is sigstore-compatible (bundle_schema: 2).
v0.6 (shipped): Format adapters (text, PDF, DOCX, image LSB) ported from Python to Rust, producing a statically-linked binary that handles all supported document types without a Python runtime.
v1.0 (in progress): FastAPI registry port to Rust using Axum and
SQLx (1,125 LOC, all endpoints implemented, #![forbid(unsafe_code)]).
Registry federation spec
(docs/spec/registry-v1.md)
is a v1.0 prerequisite. Wire format stability declaration and specification freeze are
the gating items for external implementors.
This document describes the architecture as of v0.4.5. The codebase is actively developed; consult the repository for the latest structure.