Zion Boggan · April 2026 · Oversight Protocol v0.4.5

The oversight_core package exposes the full seal/open pipeline, watermark embedding and recovery, semantic marking, content fingerprinting, error-correcting codes, manifest construction, and cryptographic primitives. All public symbols are importable from the top-level package or from their respective submodules. This reference covers the Python implementation; the Rust API mirrors these interfaces but uses native types and ownership semantics.

Contents

container (seal, open_sealed) · watermark (L1/L2 marks, fusion) · semantic (L3 marks) · fingerprint (content identification) · ecc (error correction) · manifest (metadata binding) · crypto (primitives)

container

The container module implements the binary .sealed format and exposes the two primary entry points for the protocol: seal() and open_sealed(). It also provides the SealedFile dataclass for low-level access to the container fields.

seal()

ParameterTypeDescription
plaintext bytes Raw content to encrypt. Watermarking, if desired, must be applied before calling seal.
manifest Manifest Pre-populated manifest. content_hash must equal sha256(plaintext) and size_bytes must equal len(plaintext).
issuer_ed25519_priv bytes (32) Issuer's Ed25519 private key seed for signing the manifest.
recipient_x25519_pub bytes (32) Recipient's X25519 public key. Must match manifest.recipient.x25519_pub.

Returns: bytes containing the complete .sealed binary blob.

The function signs the manifest with Ed25519, generates a random 256-bit DEK, wraps the DEK via ECIES (X25519 + HKDF-SHA256 + XChaCha20-Poly1305), encrypts the plaintext with XChaCha20-Poly1305 using the manifest's content hash as AAD, and assembles the container. Raises ValueError if any precondition is violated.

open_sealed()

ParameterTypeDescription
blob bytes The .sealed file contents.
recipient_x25519_priv bytes (32) Recipient's X25519 private key for DEK unwrapping.
trusted_issuer_pubs Optional[set[str]] If provided, the issuer's Ed25519 public key (hex) must be in this set. Rejects unknown issuers.
policy_ctx Optional[PolicyContext] Runtime context for policy enforcement (clock, IP, open counter state).

Returns: tuple[bytes, Manifest] containing the decrypted plaintext and the parsed, verified manifest.

Verification order is: parse container, verify Ed25519 signature, check trusted-issuer list, enforce time/jurisdiction/max_opens policy, unwrap DEK, AEAD decrypt, and post-decrypt SHA-256 content hash check. Raises ValueError on any integrity failure and PolicyViolation if policy constraints are not met. For multi-recipient containers, all wrapped DEK slots are tried until one succeeds.

SealedFile

A dataclass representing the parsed binary container. Fields: manifest (Manifest), wrapped_dek (dict), aead_nonce (bytes, 24), ciphertext (bytes), suite_id (int, default 1 for CLASSIC_V1). Provides to_bytes() for serialization and from_bytes(data) for parsing with length-field validation against DoS caps (4 MB manifest, 1 MB wrapped DEK, 4 GB ciphertext).

watermark

The watermark module handles L1 (zero-width Unicode) and L2 (trailing whitespace) watermark embedding and extraction, as well as high-level multi-layer application, recovery, and fusion. L3 semantic marks are delegated to the semantic module but accessible through apply_all() and verify_l3() here.

new_mark_id()

ParameterTypeDescription
n_bytesint (default 8)Length of the mark ID in bytes. 8 bytes (64 bits) is the default.

Returns: bytes, a cryptographically random per-recipient mark identifier.

embed_zw()

ParameterTypeDescription
textstrInput text to watermark.
mark_idbytesPer-recipient mark identifier.
densityint (default 40)Approximate character interval between redundant mark frames. Lower values insert more copies.

Returns: str with zero-width Unicode frames (ZWSP/ZWNJ/ZWJ) inserted at regular intervals.

extract_zw()

ParameterTypeDescription
textstrPotentially watermarked text.
mark_len_bytesint (default 8)Expected mark ID length in bytes.

Returns: list[bytes], all recovered mark IDs from zero-width frames. May contain duplicates if multiple frames survived.

embed_ws() / extract_ws()

L2 trailing-whitespace encoding. embed_ws(text, mark_id) appends a trailing space (bit 0) or tab (bit 1) to lines that have no existing trailing whitespace. extract_ws(text, mark_len_bytes) reads the mark back, returning Optional[bytes] (None if insufficient lines).

extract_ws_partial()

ParameterTypeDescription
textstrText with potential L2 marks.
mark_len_bytesint (default 8)Expected mark ID length.

Returns: tuple[Optional[bytes], float, int, int] containing (best_candidate, confidence, bits_recovered, bits_needed). Confidence is the ratio of recovered bits to needed bits. Unknown bits are zero-padded. Partial candidates with confidence ≥ 0.5 are included in fusion scoring.

apply_all()

ParameterTypeDescription
textstrInput text.
mark_idbytesPer-recipient mark identifier (shared across all layers).

Returns: str with all available layers applied. Layer ordering is L3 (semantic) first, then L2 (whitespace), then L1 (zero-width). This ordering prevents L1's invisible characters from fragmenting L3's synonym words.

recover_marks()

ParameterTypeDescription
textstrLeaked text to analyze.
mark_len_bytesint (default 8)Expected mark ID length.

Returns: dict with keys L1_zero_width, L2_whitespace, L3_synonyms, each containing a list of candidate mark bytes. L3 returns empty because it requires candidate-based verification (see verify_l3).

recover_marks_v2()

ParameterTypeDescription
textstrLeaked text.
candidate_mark_idslist[bytes] | NoneKnown mark IDs from registry for L3 verification. If None, L3 is skipped.
mark_len_bytesint (default 8)Expected mark ID length.

Returns: dict with keys layers (per-layer results and confidence), candidates (fused ranked list of (mark_id, combined_score, evidence_summary) tuples), and diagnostics (human-readable status strings per layer). Fusion uses independence-assumption score combination: combined = 1 - (1-s1)(1-s2)...(1-sN).

verify_l3()

ParameterTypeDescription
textstrText to test.
candidate_mark_idslist[bytes]Mark IDs to test against semantic marks in the text.
thresholdfloat (default 0.70)Minimum weighted score for a match.

Returns: list[tuple[bytes, float, dict]] of (mark_id, score, detail_dict) for candidates above the threshold, sorted by score descending. Delegates to semantic.verify_semantic().

semantic

The semantic module implements L3 watermarking: synonym-class rotation, punctuation-style fingerprinting, spelling variants, contraction choices, and number formatting. These marks survive format conversion, invisible-character stripping, and OCR because the signal is encoded in the words and punctuation themselves, not in formatting metadata.

apply_semantic()

ParameterTypeDescription
textstrInput text.
mark_idbytesPer-recipient mark identifier.
use_v2bool (default True)Use the expanded 151-class v2 synonym dictionary. Falls back to the 27-class v1 table if False or if the v2 module is unavailable.

Returns: str with all L3 sublayers applied (synonyms, punctuation, spelling, contractions, number formatting). This is the primary L3 embedding entry point.

verify_semantic()

ParameterTypeDescription
textstrText to verify.
candidate_mark_idbytesThe mark ID to test.
use_v2bool (default True)Use v2 dictionary for synonym verification.

Returns: dict with per-sublayer scores (synonyms_score, punctuation_score, spelling_score, contraction_score), per-sublayer hit counts (e.g., "2/3"), a weighted weighted_score (weights: synonyms 0.50, punctuation 0.10, spelling 0.20, contractions 0.20), and an overall_match boolean (True if weighted_score ≥ 0.65).

embed_synonyms_v2()

ParameterTypeDescription
textstrInput text.
mark_idbytesMark identifier.
min_instancesint (default 8)Minimum synonym-class hits required. If the text has fewer, no embedding occurs and the text is returned unchanged.

Returns: str with synonym words replaced according to a deterministic variant sequence derived from mark_id via SHA-256 expansion. Skips URLs, email addresses, file paths, code blocks, hex strings, and base64 content. Falls back to the v1 27-class table if the v2 dictionary is unavailable.

embed_spelling()

ParameterTypeDescription
textstrInput text.
mark_idbytesMark identifier. Bits 8-32 (offset from punctuation bits) select American vs British spelling for each of 25 variant pairs.

Returns: str with spelling variants applied (e.g., "color"/"colour", "organize"/"organise"). Case-preserving substitution.

embed_contractions()

ParameterTypeDescription
textstrInput text.
mark_idbytesMark identifier. Bits 40+ select contracted vs expanded form for each of 30 contraction pairs.

Returns: str with contractions expanded or collapsed per the mark_id (e.g., "don't"/"do not", "it's"/"it is").

embed_punctuation() / extract_punctuation_bits()

embed_punctuation(text, mark_id) applies three deterministic punctuation choices: bit 0 controls the Oxford comma, bit 1 selects em dash vs double-hyphen, and bit 2 selects curly vs straight quotes. Idempotent. extract_punctuation_bits(text) returns list[int] with the detected bit values (up to 3 bits), based on which style dominates in the text.

embed_number_format()

embed_number_format(text, mark_id) applies two number formatting choices: bit 72 controls comma separators in numbers ≥ 1000 ("1,000" vs "1000"), and bit 73 controls percent symbol vs word form ("50%" vs "50 percent").

fingerprint

The fingerprint module provides server-side content identification for leak detection when all embedded watermarks have been stripped. Fingerprints are computed at seal time and stored alongside the manifest or in the registry. They never appear in the document itself, so an adversary cannot strip what is not embedded.

ContentFingerprint

The primary class combining winnowing and sentence fingerprints for a document.

AttributeTypeDescription
winnowing_fplist[int]Sorted list of selected winnowing hash values.
sentence_fplist[str]List of 16-char hex hashes, one per sentence (order-independent within each sentence).
text_lengthintOriginal text length in characters.
sentence_countintNumber of detected sentences.

ContentFingerprint.from_text()

ParameterTypeDescription
textstrSource text.
kint (default 10)K-gram size for winnowing (character-level).
windowint (default 4)Winnowing window size.

Returns: ContentFingerprint instance.

ContentFingerprint.similarity()

ParameterTypeDescription
otherContentFingerprintThe fingerprint to compare against.

Returns: dict with keys winnowing (Jaccard similarity), sentence (set-overlap fraction), combined (0.4 * winnowing + 0.6 * sentence), and verdict (one of MATCH if ≥ 0.6, LIKELY if ≥ 0.3, UNLIKELY if ≥ 0.1, or NO_MATCH).

ContentFingerprint.to_dict() / from_dict()

Serialization and deserialization for storage in the manifest or registry. to_dict() returns a plain dict; from_dict(d) is a classmethod that reconstructs the fingerprint from a stored dict.

winnow()

ParameterTypeDescription
textstrInput text (will be normalized: lowercased, whitespace collapsed, non-alphanumeric stripped).
kint (default 10)K-gram size.
windowint (default 4)Winnowing window size. Smaller windows produce more hashes (higher recall, lower precision).

Returns: list[int], sorted list of selected hash values. Uses MD5 (truncated to 32 bits) as the rolling hash function.

sentence_fingerprint()

ParameterTypeDescription
textstrInput text.

Returns: list[str] of 16-char hex hashes (SHA-256, truncated). Each hash represents a sentence's sorted content words (words with > 2 characters). Sentences with fewer than 3 words are skipped.

winnow_similarity() / sentence_similarity()

winnow_similarity(fp1, fp2) computes Jaccard similarity between two winnowing fingerprints (intersection over union of hash sets). sentence_similarity(fp1, fp2) computes the fraction of hashes in fp2 that appear in fp1.

ecc

The ecc module provides error-correcting codes for watermark bit protection. It implements a repetition-code with majority-vote decoding: each payload bit is repeated R times, and decoding recovers the original bit by majority vote over each group. With R=7 (the default), up to 3 errors per group are corrected, tolerating approximately 40% random bit error rate.

encode()

ParameterTypeDescription
payloadbytesRaw bytes to protect (typically 8-byte mark_id).
repetitionsint (default 7)Odd number of times each bit is repeated. Higher values increase error tolerance at the cost of bandwidth.

Returns: list[int] of coded bits. Length = len(payload) * 8 * repetitions.

decode()

ParameterTypeDescription
coded_bitslist[int]Received bits (may contain errors). Padded or truncated to expected length.
payload_lenint (default 8)Expected payload length in bytes.
repetitionsint (default 7)Repetition factor used during encoding.

Returns: tuple[bytes, float, int] containing (recovered_payload, confidence, errors_corrected). Confidence is the fraction of groups where the majority vote was unanimous (all bits agreed).

verify_with_ecc()

ParameterTypeDescription
observed_variant_indiceslist[int]Synonym variant indices observed in the text.
candidate_mark_idbytesThe mark ID to verify.
class_sizeint (default 3)Number of variants per synonym class.
repetitionsint (default 3)ECC repetition factor.

Returns: tuple[bool, float, bytes] containing (match, confidence, decoded_mark_id). Compares the expected variant sequence against observed choices, then either decodes via ECC (if enough bits are available) or falls back to simple ratio matching with a 0.70 threshold.

manifest

The manifest module defines the signed metadata that binds a sealed file to its recipient, watermarks, beacons, and policy. The manifest is serialized as canonical JSON (sorted keys, no optional whitespace, null fields omitted) and signed with Ed25519.

Manifest

A dataclass with the following fields:

FieldTypeDescription
file_idstrUUID4 identifier for the sealed file.
issued_atintUnix timestamp (seconds) when the manifest was created.
versionstrProtocol version string, e.g., "OVERSIGHT-v1".
suitestrAlgorithm suite identifier ("OSGT-CLASSIC-v1" or "OSGT-HYBRID-v1").
original_filenamestrOriginal file name at seal time.
content_hashstrSHA-256 hex digest of the plaintext.
content_typestrMIME type (default "application/octet-stream").
size_bytesintPlaintext size in bytes.
issuer_idstrStable identifier for the issuer.
issuer_ed25519_pubstrIssuer's Ed25519 public key (hex).
recipientOptional[Recipient]Recipient binding.
watermarkslist[WatermarkRef]Per-recipient watermark references.
beaconslist[dict]Beacon token descriptors.
policydictPolicy constraints: not_after, max_opens, jurisdiction, registry_url.
canonical_content_hashstrAdded in v0.4.5. SHA-256 hex digest of the pre-watermark source bytes. Provides a dispute anchor when L3 produces a recipient copy that is textually non-identical to the canonical source.
l3_policydictAdded in v0.4.5. Records the L3 safety decision at seal time: mode (off, full, or boilerplate), document_class, and ack (whether the non-identity acknowledgement was given).
signature_ed25519strEd25519 signature over canonical bytes (hex). Filled by sign().
signature_ml_dsastrReserved for post-quantum ML-DSA-65 signature.

Manifest.new()

Class method that constructs a new Manifest with a fresh UUID4 file_id and current timestamp. Accepts original_filename, content_hash, size_bytes, issuer_id, issuer_ed25519_pub_hex, recipient (Recipient), registry_url, and optional content_type, not_after, max_opens, jurisdiction.

Manifest.sign() / Manifest.verify()

sign(issuer_ed25519_priv: bytes) computes the canonical bytes (excluding signature fields), signs with Ed25519, and stores the hex signature in signature_ed25519. verify() -> bool checks the stored signature against the issuer's public key and the canonical bytes.

Manifest.canonical_bytes() / to_json() / from_json()

canonical_bytes() returns the UTF-8 canonical JSON used as the signing input (signature fields set to empty string, null values stripped, keys sorted). to_json() returns the full manifest including signatures. from_json(data: bytes) deserializes from JSON bytes, reconstructing nested Recipient and WatermarkRef objects.

Recipient

FieldTypeDescription
recipient_idstrStable identifier (email hash, user UUID, etc.).
x25519_pubstrRecipient's X25519 public key (hex).
ed25519_pubOptional[str]Recipient's Ed25519 public key (hex), for verifying recipient acknowledgments.

WatermarkRef

FieldTypeDescription
layerstrLayer identifier: "L1_zero_width", "L2_whitespace", or "L3_semantic".
mark_idstrHex-encoded per-recipient mark identifier.

crypto

The crypto module wraps vetted cryptographic primitives with no custom constructions. It uses the cryptography library (OpenSSL backend) for classical operations and oqs-python for post-quantum hooks (ML-KEM-768, ML-DSA-65) when available.

ClassicIdentity

A dataclass holding an X25519 keypair (encryption) and an Ed25519 keypair (signing), each as 32-byte raw values. ClassicIdentity.generate() creates a new random identity. public_bundle() returns a dict with hex-encoded public keys suitable for distribution to issuers.

content_hash()

ParameterTypeDescription
databytesRaw content to hash.

Returns: str, the SHA-256 hex digest. Used for the manifest's content_hash field and as AEAD additional data.

random_dek()

Returns: bytes (32), a cryptographically random document encryption key for XChaCha20-Poly1305.

aead_encrypt() / aead_decrypt()

aead_encrypt(key, plaintext, aad) encrypts with XChaCha20-Poly1305 using a random 24-byte nonce, returning tuple[bytes, bytes] (nonce, ciphertext with tag). aead_decrypt(key, nonce, ciphertext, aad) decrypts and verifies the authentication tag. Raises an exception on tag mismatch.

wrap_dek_for_recipient() / unwrap_dek()

ECIES-style DEK wrapping. wrap_dek_for_recipient(dek, recipient_x25519_pub) generates an ephemeral X25519 keypair, performs key agreement, derives a wrapping key via HKDF-SHA256 with info string b"oversight-v1-dek-wrap", and encrypts the DEK with XChaCha20-Poly1305. Returns a dict with hex-encoded ephemeral_pub, nonce, and wrapped_dek. unwrap_dek(wrapped, recipient_x25519_priv) reverses the operation.

sign_manifest() / verify_manifest()

sign_manifest(manifest_bytes, ed25519_priv) returns the Ed25519 signature (bytes). verify_manifest(manifest_bytes, signature, ed25519_pub) returns bool.

Post-Quantum Functions

Available when oqs-python is installed (PQ_AVAILABLE = True): pq_kem_keypair(), pq_kem_encap(), pq_kem_decap() for ML-KEM-768, and pq_sig_keypair(), pq_sign(), pq_verify() for ML-DSA-65. hybrid_wrap_dek() combines X25519 and ML-KEM-768 shared secrets (X-wing-style) via HKDF with info string b"oversight-hybrid-v1-dek-wrap".


This reference documents the Python API as of v0.4.5, including the canonical_content_hash and l3_policy manifest fields and the l3_policy module added in that release. The Rust implementation provides equivalent functionality with native types. Consult the repository for the latest interfaces.