API Reference

Zion Boggan · April 2026 · Oversight Protocol v0.4.5

The oversight_core package exposes the full seal/open pipeline, watermark embedding and recovery, semantic marking, content fingerprinting, error-correcting codes, manifest construction, and cryptographic primitives. All public symbols are importable from the top-level package or from their respective submodules. This reference covers the Python implementation; the Rust API mirrors these interfaces but uses native types and ownership semantics.

container (seal, open_sealed) · watermark (L1/L2 marks, fusion) · semantic (L3 marks) · fingerprint (content identification) · ecc (error correction) · manifest (metadata binding) · crypto (primitives)

container

The container module implements the binary .sealed format and exposes the two primary entry points for the protocol: seal() and open_sealed(). It also provides the SealedFile dataclass for low-level access to the container fields.

seal()

Parameter	Type	Description
`plaintext`	`bytes`	Raw content to encrypt. Watermarking, if desired, must be applied before calling seal.
`manifest`	`Manifest`	Pre-populated manifest. `content_hash` must equal `sha256(plaintext)` and `size_bytes` must equal `len(plaintext)`.
`issuer_ed25519_priv`	`bytes` (32)	Issuer's Ed25519 private key seed for signing the manifest.
`recipient_x25519_pub`	`bytes` (32)	Recipient's X25519 public key. Must match `manifest.recipient.x25519_pub`.

Returns: bytes containing the complete .sealed binary blob.

The function signs the manifest with Ed25519, generates a random 256-bit DEK, wraps the DEK via ECIES (X25519 + HKDF-SHA256 + XChaCha20-Poly1305), encrypts the plaintext with XChaCha20-Poly1305 using the manifest's content hash as AAD, and assembles the container. Raises ValueError if any precondition is violated.

open_sealed()

Parameter	Type	Description
`blob`	`bytes`	The `.sealed` file contents.
`recipient_x25519_priv`	`bytes` (32)	Recipient's X25519 private key for DEK unwrapping.
`trusted_issuer_pubs`	`Optional[set[str]]`	If provided, the issuer's Ed25519 public key (hex) must be in this set. Rejects unknown issuers.
`policy_ctx`	`Optional[PolicyContext]`	Runtime context for policy enforcement (clock, IP, open counter state).

Returns: tuple[bytes, Manifest] containing the decrypted plaintext and the parsed, verified manifest.

Verification order is: parse container, verify Ed25519 signature, check trusted-issuer list, enforce time/jurisdiction/max_opens policy, unwrap DEK, AEAD decrypt, and post-decrypt SHA-256 content hash check. Raises ValueError on any integrity failure and PolicyViolation if policy constraints are not met. For multi-recipient containers, all wrapped DEK slots are tried until one succeeds.

SealedFile

A dataclass representing the parsed binary container. Fields: manifest (Manifest), wrapped_dek (dict), aead_nonce (bytes, 24), ciphertext (bytes), suite_id (int, default 1 for CLASSIC_V1). Provides to_bytes() for serialization and from_bytes(data) for parsing with length-field validation against DoS caps (4 MB manifest, 1 MB wrapped DEK, 4 GB ciphertext).

watermark

The watermark module handles L1 (zero-width Unicode) and L2 (trailing whitespace) watermark embedding and extraction, as well as high-level multi-layer application, recovery, and fusion. L3 semantic marks are delegated to the semantic module but accessible through apply_all() and verify_l3() here.

new_mark_id()

Parameter	Type	Description
`n_bytes`	`int` (default 8)	Length of the mark ID in bytes. 8 bytes (64 bits) is the default.

Returns: bytes, a cryptographically random per-recipient mark identifier.

embed_zw()

Parameter	Type	Description
`text`	`str`	Input text to watermark.
`mark_id`	`bytes`	Per-recipient mark identifier.
`density`	`int` (default 40)	Approximate character interval between redundant mark frames. Lower values insert more copies.

Returns: str with zero-width Unicode frames (ZWSP/ZWNJ/ZWJ) inserted at regular intervals.

extract_zw()

Parameter	Type	Description
`text`	`str`	Potentially watermarked text.
`mark_len_bytes`	`int` (default 8)	Expected mark ID length in bytes.

Returns: list[bytes], all recovered mark IDs from zero-width frames. May contain duplicates if multiple frames survived.

embed_ws() / extract_ws()

L2 trailing-whitespace encoding. embed_ws(text, mark_id) appends a trailing space (bit 0) or tab (bit 1) to lines that have no existing trailing whitespace. extract_ws(text, mark_len_bytes) reads the mark back, returning Optional[bytes] (None if insufficient lines).

extract_ws_partial()

Parameter	Type	Description
`text`	`str`	Text with potential L2 marks.
`mark_len_bytes`	`int` (default 8)	Expected mark ID length.

Returns: tuple[Optional[bytes], float, int, int] containing (best_candidate, confidence, bits_recovered, bits_needed). Confidence is the ratio of recovered bits to needed bits. Unknown bits are zero-padded. Partial candidates with confidence ≥ 0.5 are included in fusion scoring.

apply_all()

Parameter	Type	Description
`text`	`str`	Input text.
`mark_id`	`bytes`	Per-recipient mark identifier (shared across all layers).

Returns: str with all available layers applied. Layer ordering is L3 (semantic) first, then L2 (whitespace), then L1 (zero-width). This ordering prevents L1's invisible characters from fragmenting L3's synonym words.

recover_marks()

Parameter	Type	Description
`text`	`str`	Leaked text to analyze.
`mark_len_bytes`	`int` (default 8)	Expected mark ID length.

Returns: dict with keys L1_zero_width, L2_whitespace, L3_synonyms, each containing a list of candidate mark bytes. L3 returns empty because it requires candidate-based verification (see verify_l3).

recover_marks_v2()

Parameter	Type	Description
`text`	`str`	Leaked text.
`candidate_mark_ids`	`list[bytes] \| None`	Known mark IDs from registry for L3 verification. If None, L3 is skipped.
`mark_len_bytes`	`int` (default 8)	Expected mark ID length.

Returns: dict with keys layers (per-layer results and confidence), candidates (fused ranked list of (mark_id, combined_score, evidence_summary) tuples), and diagnostics (human-readable status strings per layer). Fusion uses independence-assumption score combination: combined = 1 - (1-s1)(1-s2)...(1-sN).

verify_l3()

Parameter	Type	Description
`text`	`str`	Text to test.
`candidate_mark_ids`	`list[bytes]`	Mark IDs to test against semantic marks in the text.
`threshold`	`float` (default 0.70)	Minimum weighted score for a match.

Returns: list[tuple[bytes, float, dict]] of (mark_id, score, detail_dict) for candidates above the threshold, sorted by score descending. Delegates to semantic.verify_semantic().

semantic

The semantic module implements L3 watermarking: synonym-class rotation, punctuation-style fingerprinting, spelling variants, contraction choices, and number formatting. These marks survive format conversion, invisible-character stripping, and OCR because the signal is encoded in the words and punctuation themselves, not in formatting metadata.

apply_semantic()

Parameter	Type	Description
`text`	`str`	Input text.
`mark_id`	`bytes`	Per-recipient mark identifier.
`use_v2`	`bool` (default True)	Use the expanded 151-class v2 synonym dictionary. Falls back to the 27-class v1 table if False or if the v2 module is unavailable.

Returns: str with all L3 sublayers applied (synonyms, punctuation, spelling, contractions, number formatting). This is the primary L3 embedding entry point.

verify_semantic()

Parameter	Type	Description
`text`	`str`	Text to verify.
`candidate_mark_id`	`bytes`	The mark ID to test.
`use_v2`	`bool` (default True)	Use v2 dictionary for synonym verification.

Returns: dict with per-sublayer scores (synonyms_score, punctuation_score, spelling_score, contraction_score), per-sublayer hit counts (e.g., "2/3"), a weighted weighted_score (weights: synonyms 0.50, punctuation 0.10, spelling 0.20, contractions 0.20), and an overall_match boolean (True if weighted_score ≥ 0.65).

embed_synonyms_v2()

Parameter	Type	Description
`text`	`str`	Input text.
`mark_id`	`bytes`	Mark identifier.
`min_instances`	`int` (default 8)	Minimum synonym-class hits required. If the text has fewer, no embedding occurs and the text is returned unchanged.

Returns: str with synonym words replaced according to a deterministic variant sequence derived from mark_id via SHA-256 expansion. Skips URLs, email addresses, file paths, code blocks, hex strings, and base64 content. Falls back to the v1 27-class table if the v2 dictionary is unavailable.

embed_spelling()

Parameter	Type	Description
`text`	`str`	Input text.
`mark_id`	`bytes`	Mark identifier. Bits 8-32 (offset from punctuation bits) select American vs British spelling for each of 25 variant pairs.

Returns: str with spelling variants applied (e.g., "color"/"colour", "organize"/"organise"). Case-preserving substitution.

embed_contractions()

Parameter	Type	Description
`text`	`str`	Input text.
`mark_id`	`bytes`	Mark identifier. Bits 40+ select contracted vs expanded form for each of 30 contraction pairs.

Returns: str with contractions expanded or collapsed per the mark_id (e.g., "don't"/"do not", "it's"/"it is").

embed_punctuation() / extract_punctuation_bits()

embed_punctuation(text, mark_id) applies three deterministic punctuation choices: bit 0 controls the Oxford comma, bit 1 selects em dash vs double-hyphen, and bit 2 selects curly vs straight quotes. Idempotent. extract_punctuation_bits(text) returns list[int] with the detected bit values (up to 3 bits), based on which style dominates in the text.

embed_number_format()

embed_number_format(text, mark_id) applies two number formatting choices: bit 72 controls comma separators in numbers ≥ 1000 ("1,000" vs "1000"), and bit 73 controls percent symbol vs word form ("50%" vs "50 percent").

fingerprint

The fingerprint module provides server-side content identification for leak detection when all embedded watermarks have been stripped. Fingerprints are computed at seal time and stored alongside the manifest or in the registry. They never appear in the document itself, so an adversary cannot strip what is not embedded.

ContentFingerprint

The primary class combining winnowing and sentence fingerprints for a document.

Attribute	Type	Description
`winnowing_fp`	`list[int]`	Sorted list of selected winnowing hash values.
`sentence_fp`	`list[str]`	List of 16-char hex hashes, one per sentence (order-independent within each sentence).
`text_length`	`int`	Original text length in characters.
`sentence_count`	`int`	Number of detected sentences.

ContentFingerprint.from_text()

Parameter	Type	Description
`text`	`str`	Source text.
`k`	`int` (default 10)	K-gram size for winnowing (character-level).
`window`	`int` (default 4)	Winnowing window size.

Returns: ContentFingerprint instance.

ContentFingerprint.similarity()

Parameter	Type	Description
`other`	`ContentFingerprint`	The fingerprint to compare against.

Returns: dict with keys winnowing (Jaccard similarity), sentence (set-overlap fraction), combined (0.4 * winnowing + 0.6 * sentence), and verdict (one of MATCH if ≥ 0.6, LIKELY if ≥ 0.3, UNLIKELY if ≥ 0.1, or NO_MATCH).

ContentFingerprint.to_dict() / from_dict()

Serialization and deserialization for storage in the manifest or registry. to_dict() returns a plain dict; from_dict(d) is a classmethod that reconstructs the fingerprint from a stored dict.

winnow()

Parameter	Type	Description
`text`	`str`	Input text (will be normalized: lowercased, whitespace collapsed, non-alphanumeric stripped).
`k`	`int` (default 10)	K-gram size.
`window`	`int` (default 4)	Winnowing window size. Smaller windows produce more hashes (higher recall, lower precision).

Returns: list[int], sorted list of selected hash values. Uses MD5 (truncated to 32 bits) as the rolling hash function.

sentence_fingerprint()

Parameter	Type	Description
`text`	`str`	Input text.

Returns: list[str] of 16-char hex hashes (SHA-256, truncated). Each hash represents a sentence's sorted content words (words with > 2 characters). Sentences with fewer than 3 words are skipped.

winnow_similarity() / sentence_similarity()

winnow_similarity(fp1, fp2) computes Jaccard similarity between two winnowing fingerprints (intersection over union of hash sets). sentence_similarity(fp1, fp2) computes the fraction of hashes in fp2 that appear in fp1.

ecc

The ecc module provides error-correcting codes for watermark bit protection. It implements a repetition-code with majority-vote decoding: each payload bit is repeated R times, and decoding recovers the original bit by majority vote over each group. With R=7 (the default), up to 3 errors per group are corrected, tolerating approximately 40% random bit error rate.

encode()

Parameter	Type	Description
`payload`	`bytes`	Raw bytes to protect (typically 8-byte mark_id).
`repetitions`	`int` (default 7)	Odd number of times each bit is repeated. Higher values increase error tolerance at the cost of bandwidth.

Returns: list[int] of coded bits. Length = len(payload) * 8 * repetitions.

decode()

Parameter	Type	Description
`coded_bits`	`list[int]`	Received bits (may contain errors). Padded or truncated to expected length.
`payload_len`	`int` (default 8)	Expected payload length in bytes.
`repetitions`	`int` (default 7)	Repetition factor used during encoding.

Returns: tuple[bytes, float, int] containing (recovered_payload, confidence, errors_corrected). Confidence is the fraction of groups where the majority vote was unanimous (all bits agreed).

verify_with_ecc()

Parameter	Type	Description
`observed_variant_indices`	`list[int]`	Synonym variant indices observed in the text.
`candidate_mark_id`	`bytes`	The mark ID to verify.
`class_size`	`int` (default 3)	Number of variants per synonym class.
`repetitions`	`int` (default 3)	ECC repetition factor.

Returns: tuple[bool, float, bytes] containing (match, confidence, decoded_mark_id). Compares the expected variant sequence against observed choices, then either decodes via ECC (if enough bits are available) or falls back to simple ratio matching with a 0.70 threshold.

manifest

The manifest module defines the signed metadata that binds a sealed file to its recipient, watermarks, beacons, and policy. The manifest is serialized as canonical JSON (sorted keys, no optional whitespace, null fields omitted) and signed with Ed25519.

Manifest

A dataclass with the following fields:

Field	Type	Description
`file_id`	`str`	UUID4 identifier for the sealed file.
`issued_at`	`int`	Unix timestamp (seconds) when the manifest was created.
`version`	`str`	Protocol version string, e.g., `"OVERSIGHT-v1"`.
`suite`	`str`	Algorithm suite identifier (`"OSGT-CLASSIC-v1"` or `"OSGT-HYBRID-v1"`).
`original_filename`	`str`	Original file name at seal time.
`content_hash`	`str`	SHA-256 hex digest of the plaintext.
`content_type`	`str`	MIME type (default `"application/octet-stream"`).
`size_bytes`	`int`	Plaintext size in bytes.
`issuer_id`	`str`	Stable identifier for the issuer.
`issuer_ed25519_pub`	`str`	Issuer's Ed25519 public key (hex).
`recipient`	`Optional[Recipient]`	Recipient binding.
`watermarks`	`list[WatermarkRef]`	Per-recipient watermark references.
`beacons`	`list[dict]`	Beacon token descriptors.
`policy`	`dict`	Policy constraints: `not_after`, `max_opens`, `jurisdiction`, `registry_url`.
`canonical_content_hash`	`str`	Added in v0.4.5. SHA-256 hex digest of the pre-watermark source bytes. Provides a dispute anchor when L3 produces a recipient copy that is textually non-identical to the canonical source.
`l3_policy`	`dict`	Added in v0.4.5. Records the L3 safety decision at seal time: `mode` (`off`, `full`, or `boilerplate`), `document_class`, and `ack` (whether the non-identity acknowledgement was given).
`signature_ed25519`	`str`	Ed25519 signature over canonical bytes (hex). Filled by `sign()`.
`signature_ml_dsa`	`str`	Reserved for post-quantum ML-DSA-65 signature.

Manifest.new()

Class method that constructs a new Manifest with a fresh UUID4 file_id and current timestamp. Accepts original_filename, content_hash, size_bytes, issuer_id, issuer_ed25519_pub_hex, recipient (Recipient), registry_url, and optional content_type, not_after, max_opens, jurisdiction.

Manifest.sign() / Manifest.verify()

sign(issuer_ed25519_priv: bytes) computes the canonical bytes (excluding signature fields), signs with Ed25519, and stores the hex signature in signature_ed25519. verify() -> bool checks the stored signature against the issuer's public key and the canonical bytes.

Manifest.canonical_bytes() / to_json() / from_json()

canonical_bytes() returns the UTF-8 canonical JSON used as the signing input (signature fields set to empty string, null values stripped, keys sorted). to_json() returns the full manifest including signatures. from_json(data: bytes) deserializes from JSON bytes, reconstructing nested Recipient and WatermarkRef objects.

Recipient

Field	Type	Description
`recipient_id`	`str`	Stable identifier (email hash, user UUID, etc.).
`x25519_pub`	`str`	Recipient's X25519 public key (hex).
`ed25519_pub`	`Optional[str]`	Recipient's Ed25519 public key (hex), for verifying recipient acknowledgments.

WatermarkRef

Field	Type	Description
`layer`	`str`	Layer identifier: `"L1_zero_width"`, `"L2_whitespace"`, or `"L3_semantic"`.
`mark_id`	`str`	Hex-encoded per-recipient mark identifier.

crypto

The crypto module wraps vetted cryptographic primitives with no custom constructions. It uses the cryptography library (OpenSSL backend) for classical operations and oqs-python for post-quantum hooks (ML-KEM-768, ML-DSA-65) when available.

ClassicIdentity

A dataclass holding an X25519 keypair (encryption) and an Ed25519 keypair (signing), each as 32-byte raw values. ClassicIdentity.generate() creates a new random identity. public_bundle() returns a dict with hex-encoded public keys suitable for distribution to issuers.

content_hash()

Parameter	Type	Description
`data`	`bytes`	Raw content to hash.

Returns: str, the SHA-256 hex digest. Used for the manifest's content_hash field and as AEAD additional data.

random_dek()

Returns: bytes (32), a cryptographically random document encryption key for XChaCha20-Poly1305.

aead_encrypt() / aead_decrypt()

aead_encrypt(key, plaintext, aad) encrypts with XChaCha20-Poly1305 using a random 24-byte nonce, returning tuple[bytes, bytes] (nonce, ciphertext with tag). aead_decrypt(key, nonce, ciphertext, aad) decrypts and verifies the authentication tag. Raises an exception on tag mismatch.

wrap_dek_for_recipient() / unwrap_dek()

ECIES-style DEK wrapping. wrap_dek_for_recipient(dek, recipient_x25519_pub) generates an ephemeral X25519 keypair, performs key agreement, derives a wrapping key via HKDF-SHA256 with info string b"oversight-v1-dek-wrap", and encrypts the DEK with XChaCha20-Poly1305. Returns a dict with hex-encoded ephemeral_pub, nonce, and wrapped_dek. unwrap_dek(wrapped, recipient_x25519_priv) reverses the operation.

sign_manifest() / verify_manifest()

sign_manifest(manifest_bytes, ed25519_priv) returns the Ed25519 signature (bytes). verify_manifest(manifest_bytes, signature, ed25519_pub) returns bool.

Post-Quantum Functions

Available when oqs-python is installed (PQ_AVAILABLE = True): pq_kem_keypair(), pq_kem_encap(), pq_kem_decap() for ML-KEM-768, and pq_sig_keypair(), pq_sign(), pq_verify() for ML-DSA-65. hybrid_wrap_dek() combines X25519 and ML-KEM-768 shared secrets (X-wing-style) via HKDF with info string b"oversight-hybrid-v1-dek-wrap".

This reference documents the Python API as of v0.4.5, including the canonical_content_hash and l3_policy manifest fields and the l3_policy module added in that release. The Rust implementation provides equivalent functionality with native types. Consult the repository for the latest interfaces.

Contents