API Reference
Python API documentation for oversight_core v0.4.5
Zion Boggan · April 2026 · Oversight Protocol v0.4.5
The oversight_core package exposes the full seal/open pipeline, watermark
embedding and recovery, semantic marking, content fingerprinting, error-correcting codes,
manifest construction, and cryptographic primitives. All public symbols are importable
from the top-level package or from their respective submodules. This reference covers
the Python implementation; the Rust API mirrors these interfaces but uses native types
and ownership semantics.
Contents
container (seal, open_sealed) · watermark (L1/L2 marks, fusion) · semantic (L3 marks) · fingerprint (content identification) · ecc (error correction) · manifest (metadata binding) · crypto (primitives)
container
The container module implements the binary .sealed format
and exposes the two primary entry points for the protocol: seal() and
open_sealed(). It also provides the SealedFile dataclass
for low-level access to the container fields.
seal()
| Parameter | Type | Description |
|---|---|---|
plaintext |
bytes |
Raw content to encrypt. Watermarking, if desired, must be applied before calling seal. |
manifest |
Manifest |
Pre-populated manifest. content_hash must equal sha256(plaintext) and size_bytes must equal len(plaintext). |
issuer_ed25519_priv |
bytes (32) |
Issuer's Ed25519 private key seed for signing the manifest. |
recipient_x25519_pub |
bytes (32) |
Recipient's X25519 public key. Must match manifest.recipient.x25519_pub. |
Returns: bytes containing the complete .sealed binary blob.
The function signs the manifest with Ed25519, generates a random 256-bit DEK, wraps
the DEK via ECIES (X25519 + HKDF-SHA256 + XChaCha20-Poly1305), encrypts the plaintext
with XChaCha20-Poly1305 using the manifest's content hash as AAD, and assembles the
container. Raises ValueError if any precondition is violated.
open_sealed()
| Parameter | Type | Description |
|---|---|---|
blob |
bytes |
The .sealed file contents. |
recipient_x25519_priv |
bytes (32) |
Recipient's X25519 private key for DEK unwrapping. |
trusted_issuer_pubs |
Optional[set[str]] |
If provided, the issuer's Ed25519 public key (hex) must be in this set. Rejects unknown issuers. |
policy_ctx |
Optional[PolicyContext] |
Runtime context for policy enforcement (clock, IP, open counter state). |
Returns: tuple[bytes, Manifest] containing the decrypted plaintext and the parsed, verified manifest.
Verification order is: parse container, verify Ed25519 signature, check trusted-issuer
list, enforce time/jurisdiction/max_opens policy, unwrap DEK, AEAD decrypt, and
post-decrypt SHA-256 content hash check. Raises ValueError on any
integrity failure and PolicyViolation if policy constraints are not met.
For multi-recipient containers, all wrapped DEK slots are tried until one succeeds.
SealedFile
A dataclass representing the parsed binary container. Fields: manifest
(Manifest), wrapped_dek (dict), aead_nonce (bytes, 24),
ciphertext (bytes), suite_id (int, default 1 for CLASSIC_V1).
Provides to_bytes() for serialization and from_bytes(data)
for parsing with length-field validation against DoS caps (4 MB manifest, 1 MB wrapped
DEK, 4 GB ciphertext).
watermark
The watermark module handles L1 (zero-width Unicode) and L2 (trailing
whitespace) watermark embedding and extraction, as well as high-level multi-layer
application, recovery, and fusion. L3 semantic marks are delegated to the
semantic module but accessible through apply_all() and
verify_l3() here.
new_mark_id()
| Parameter | Type | Description |
|---|---|---|
n_bytes | int (default 8) | Length of the mark ID in bytes. 8 bytes (64 bits) is the default. |
Returns: bytes, a cryptographically random per-recipient mark identifier.
embed_zw()
| Parameter | Type | Description |
|---|---|---|
text | str | Input text to watermark. |
mark_id | bytes | Per-recipient mark identifier. |
density | int (default 40) | Approximate character interval between redundant mark frames. Lower values insert more copies. |
Returns: str with zero-width Unicode frames (ZWSP/ZWNJ/ZWJ) inserted at regular intervals.
extract_zw()
| Parameter | Type | Description |
|---|---|---|
text | str | Potentially watermarked text. |
mark_len_bytes | int (default 8) | Expected mark ID length in bytes. |
Returns: list[bytes], all recovered mark IDs from zero-width frames. May contain duplicates if multiple frames survived.
embed_ws() / extract_ws()
L2 trailing-whitespace encoding. embed_ws(text, mark_id) appends a trailing
space (bit 0) or tab (bit 1) to lines that have no existing trailing whitespace.
extract_ws(text, mark_len_bytes) reads the mark back, returning
Optional[bytes] (None if insufficient lines).
extract_ws_partial()
| Parameter | Type | Description |
|---|---|---|
text | str | Text with potential L2 marks. |
mark_len_bytes | int (default 8) | Expected mark ID length. |
Returns: tuple[Optional[bytes], float, int, int] containing
(best_candidate, confidence, bits_recovered, bits_needed). Confidence is the ratio
of recovered bits to needed bits. Unknown bits are zero-padded. Partial candidates
with confidence ≥ 0.5 are included in fusion scoring.
apply_all()
| Parameter | Type | Description |
|---|---|---|
text | str | Input text. |
mark_id | bytes | Per-recipient mark identifier (shared across all layers). |
Returns: str with all available layers applied.
Layer ordering is L3 (semantic) first, then L2 (whitespace), then L1 (zero-width).
This ordering prevents L1's invisible characters from fragmenting L3's synonym words.
recover_marks()
| Parameter | Type | Description |
|---|---|---|
text | str | Leaked text to analyze. |
mark_len_bytes | int (default 8) | Expected mark ID length. |
Returns: dict with keys L1_zero_width,
L2_whitespace, L3_synonyms, each containing a list of
candidate mark bytes. L3 returns empty because it requires candidate-based verification
(see verify_l3).
recover_marks_v2()
| Parameter | Type | Description |
|---|---|---|
text | str | Leaked text. |
candidate_mark_ids | list[bytes] | None | Known mark IDs from registry for L3 verification. If None, L3 is skipped. |
mark_len_bytes | int (default 8) | Expected mark ID length. |
Returns: dict with keys layers (per-layer
results and confidence), candidates (fused ranked list of
(mark_id, combined_score, evidence_summary) tuples), and
diagnostics (human-readable status strings per layer).
Fusion uses independence-assumption score combination:
combined = 1 - (1-s1)(1-s2)...(1-sN).
verify_l3()
| Parameter | Type | Description |
|---|---|---|
text | str | Text to test. |
candidate_mark_ids | list[bytes] | Mark IDs to test against semantic marks in the text. |
threshold | float (default 0.70) | Minimum weighted score for a match. |
Returns: list[tuple[bytes, float, dict]] of
(mark_id, score, detail_dict) for candidates above the threshold, sorted by
score descending. Delegates to semantic.verify_semantic().
semantic
The semantic module implements L3 watermarking: synonym-class rotation,
punctuation-style fingerprinting, spelling variants, contraction choices, and number
formatting. These marks survive format conversion, invisible-character stripping,
and OCR because the signal is encoded in the words and punctuation themselves,
not in formatting metadata.
apply_semantic()
| Parameter | Type | Description |
|---|---|---|
text | str | Input text. |
mark_id | bytes | Per-recipient mark identifier. |
use_v2 | bool (default True) | Use the expanded 151-class v2 synonym dictionary. Falls back to the 27-class v1 table if False or if the v2 module is unavailable. |
Returns: str with all L3 sublayers applied (synonyms,
punctuation, spelling, contractions, number formatting). This is the primary L3
embedding entry point.
verify_semantic()
| Parameter | Type | Description |
|---|---|---|
text | str | Text to verify. |
candidate_mark_id | bytes | The mark ID to test. |
use_v2 | bool (default True) | Use v2 dictionary for synonym verification. |
Returns: dict with per-sublayer scores
(synonyms_score, punctuation_score,
spelling_score, contraction_score),
per-sublayer hit counts (e.g., "2/3"), a weighted
weighted_score (weights: synonyms 0.50, punctuation 0.10,
spelling 0.20, contractions 0.20), and an overall_match
boolean (True if weighted_score ≥ 0.65).
embed_synonyms_v2()
| Parameter | Type | Description |
|---|---|---|
text | str | Input text. |
mark_id | bytes | Mark identifier. |
min_instances | int (default 8) | Minimum synonym-class hits required. If the text has fewer, no embedding occurs and the text is returned unchanged. |
Returns: str with synonym words replaced according to a
deterministic variant sequence derived from mark_id via
SHA-256 expansion. Skips URLs, email addresses, file paths, code blocks,
hex strings, and base64 content. Falls back to the v1 27-class table if
the v2 dictionary is unavailable.
embed_spelling()
| Parameter | Type | Description |
|---|---|---|
text | str | Input text. |
mark_id | bytes | Mark identifier. Bits 8-32 (offset from punctuation bits) select American vs British spelling for each of 25 variant pairs. |
Returns: str with spelling variants applied
(e.g., "color"/"colour", "organize"/"organise"). Case-preserving substitution.
embed_contractions()
| Parameter | Type | Description |
|---|---|---|
text | str | Input text. |
mark_id | bytes | Mark identifier. Bits 40+ select contracted vs expanded form for each of 30 contraction pairs. |
Returns: str with contractions expanded or collapsed
per the mark_id (e.g., "don't"/"do not", "it's"/"it is").
embed_punctuation() / extract_punctuation_bits()
embed_punctuation(text, mark_id) applies three deterministic punctuation
choices: bit 0 controls the Oxford comma, bit 1 selects em dash vs double-hyphen,
and bit 2 selects curly vs straight quotes. Idempotent.
extract_punctuation_bits(text) returns list[int] with the
detected bit values (up to 3 bits), based on which style dominates in the text.
embed_number_format()
embed_number_format(text, mark_id) applies two number formatting choices:
bit 72 controls comma separators in numbers ≥ 1000 ("1,000" vs "1000"),
and bit 73 controls percent symbol vs word form ("50%" vs "50 percent").
fingerprint
The fingerprint module provides server-side content identification
for leak detection when all embedded watermarks have been stripped. Fingerprints
are computed at seal time and stored alongside the manifest or in the registry.
They never appear in the document itself, so an adversary cannot strip what is
not embedded.
ContentFingerprint
The primary class combining winnowing and sentence fingerprints for a document.
| Attribute | Type | Description |
|---|---|---|
winnowing_fp | list[int] | Sorted list of selected winnowing hash values. |
sentence_fp | list[str] | List of 16-char hex hashes, one per sentence (order-independent within each sentence). |
text_length | int | Original text length in characters. |
sentence_count | int | Number of detected sentences. |
ContentFingerprint.from_text()
| Parameter | Type | Description |
|---|---|---|
text | str | Source text. |
k | int (default 10) | K-gram size for winnowing (character-level). |
window | int (default 4) | Winnowing window size. |
Returns: ContentFingerprint instance.
ContentFingerprint.similarity()
| Parameter | Type | Description |
|---|---|---|
other | ContentFingerprint | The fingerprint to compare against. |
Returns: dict with keys winnowing (Jaccard
similarity), sentence (set-overlap fraction), combined
(0.4 * winnowing + 0.6 * sentence), and verdict (one of
MATCH if ≥ 0.6, LIKELY if ≥ 0.3,
UNLIKELY if ≥ 0.1, or NO_MATCH).
ContentFingerprint.to_dict() / from_dict()
Serialization and deserialization for storage in the manifest or registry.
to_dict() returns a plain dict; from_dict(d) is a classmethod
that reconstructs the fingerprint from a stored dict.
winnow()
| Parameter | Type | Description |
|---|---|---|
text | str | Input text (will be normalized: lowercased, whitespace collapsed, non-alphanumeric stripped). |
k | int (default 10) | K-gram size. |
window | int (default 4) | Winnowing window size. Smaller windows produce more hashes (higher recall, lower precision). |
Returns: list[int], sorted list of selected hash values. Uses MD5 (truncated to 32 bits) as the rolling hash function.
sentence_fingerprint()
| Parameter | Type | Description |
|---|---|---|
text | str | Input text. |
Returns: list[str] of 16-char hex hashes (SHA-256,
truncated). Each hash represents a sentence's sorted content words (words with > 2
characters). Sentences with fewer than 3 words are skipped.
winnow_similarity() / sentence_similarity()
winnow_similarity(fp1, fp2) computes Jaccard similarity between two
winnowing fingerprints (intersection over union of hash sets).
sentence_similarity(fp1, fp2) computes the fraction of hashes in
fp2 that appear in fp1.
ecc
The ecc module provides error-correcting codes for watermark bit
protection. It implements a repetition-code with majority-vote decoding: each
payload bit is repeated R times, and decoding recovers the original bit by majority
vote over each group. With R=7 (the default), up to 3 errors per group are corrected,
tolerating approximately 40% random bit error rate.
encode()
| Parameter | Type | Description |
|---|---|---|
payload | bytes | Raw bytes to protect (typically 8-byte mark_id). |
repetitions | int (default 7) | Odd number of times each bit is repeated. Higher values increase error tolerance at the cost of bandwidth. |
Returns: list[int] of coded bits. Length = len(payload) * 8 * repetitions.
decode()
| Parameter | Type | Description |
|---|---|---|
coded_bits | list[int] | Received bits (may contain errors). Padded or truncated to expected length. |
payload_len | int (default 8) | Expected payload length in bytes. |
repetitions | int (default 7) | Repetition factor used during encoding. |
Returns: tuple[bytes, float, int] containing
(recovered_payload, confidence, errors_corrected). Confidence is the fraction of
groups where the majority vote was unanimous (all bits agreed).
verify_with_ecc()
| Parameter | Type | Description |
|---|---|---|
observed_variant_indices | list[int] | Synonym variant indices observed in the text. |
candidate_mark_id | bytes | The mark ID to verify. |
class_size | int (default 3) | Number of variants per synonym class. |
repetitions | int (default 3) | ECC repetition factor. |
Returns: tuple[bool, float, bytes] containing
(match, confidence, decoded_mark_id). Compares the expected variant sequence
against observed choices, then either decodes via ECC (if enough bits are
available) or falls back to simple ratio matching with a 0.70 threshold.
manifest
The manifest module defines the signed metadata that binds a sealed
file to its recipient, watermarks, beacons, and policy. The manifest is serialized
as canonical JSON (sorted keys, no optional whitespace, null fields omitted) and
signed with Ed25519.
Manifest
A dataclass with the following fields:
| Field | Type | Description |
|---|---|---|
file_id | str | UUID4 identifier for the sealed file. |
issued_at | int | Unix timestamp (seconds) when the manifest was created. |
version | str | Protocol version string, e.g., "OVERSIGHT-v1". |
suite | str | Algorithm suite identifier ("OSGT-CLASSIC-v1" or "OSGT-HYBRID-v1"). |
original_filename | str | Original file name at seal time. |
content_hash | str | SHA-256 hex digest of the plaintext. |
content_type | str | MIME type (default "application/octet-stream"). |
size_bytes | int | Plaintext size in bytes. |
issuer_id | str | Stable identifier for the issuer. |
issuer_ed25519_pub | str | Issuer's Ed25519 public key (hex). |
recipient | Optional[Recipient] | Recipient binding. |
watermarks | list[WatermarkRef] | Per-recipient watermark references. |
beacons | list[dict] | Beacon token descriptors. |
policy | dict | Policy constraints: not_after, max_opens, jurisdiction, registry_url. |
canonical_content_hash | str | Added in v0.4.5. SHA-256 hex digest of the pre-watermark source bytes. Provides a dispute anchor when L3 produces a recipient copy that is textually non-identical to the canonical source. |
l3_policy | dict | Added in v0.4.5. Records the L3 safety decision at seal time: mode (off, full, or boilerplate), document_class, and ack (whether the non-identity acknowledgement was given). |
signature_ed25519 | str | Ed25519 signature over canonical bytes (hex). Filled by sign(). |
signature_ml_dsa | str | Reserved for post-quantum ML-DSA-65 signature. |
Manifest.new()
Class method that constructs a new Manifest with a fresh UUID4 file_id and current
timestamp. Accepts original_filename, content_hash,
size_bytes, issuer_id, issuer_ed25519_pub_hex,
recipient (Recipient), registry_url, and optional
content_type, not_after, max_opens,
jurisdiction.
Manifest.sign() / Manifest.verify()
sign(issuer_ed25519_priv: bytes) computes the canonical bytes (excluding
signature fields), signs with Ed25519, and stores the hex signature in
signature_ed25519.
verify() -> bool checks the stored signature against the issuer's
public key and the canonical bytes.
Manifest.canonical_bytes() / to_json() / from_json()
canonical_bytes() returns the UTF-8 canonical JSON used as the signing
input (signature fields set to empty string, null values stripped, keys sorted).
to_json() returns the full manifest including signatures.
from_json(data: bytes) deserializes from JSON bytes, reconstructing
nested Recipient and WatermarkRef objects.
Recipient
| Field | Type | Description |
|---|---|---|
recipient_id | str | Stable identifier (email hash, user UUID, etc.). |
x25519_pub | str | Recipient's X25519 public key (hex). |
ed25519_pub | Optional[str] | Recipient's Ed25519 public key (hex), for verifying recipient acknowledgments. |
WatermarkRef
| Field | Type | Description |
|---|---|---|
layer | str | Layer identifier: "L1_zero_width", "L2_whitespace", or "L3_semantic". |
mark_id | str | Hex-encoded per-recipient mark identifier. |
crypto
The crypto module wraps vetted cryptographic primitives with no custom
constructions. It uses the cryptography library (OpenSSL backend) for
classical operations and oqs-python for post-quantum hooks (ML-KEM-768,
ML-DSA-65) when available.
ClassicIdentity
A dataclass holding an X25519 keypair (encryption) and an Ed25519 keypair (signing),
each as 32-byte raw values. ClassicIdentity.generate() creates a new
random identity. public_bundle() returns a dict with hex-encoded public
keys suitable for distribution to issuers.
content_hash()
| Parameter | Type | Description |
|---|---|---|
data | bytes | Raw content to hash. |
Returns: str, the SHA-256 hex digest. Used for the manifest's content_hash field and as AEAD additional data.
random_dek()
Returns: bytes (32), a cryptographically random document encryption key for XChaCha20-Poly1305.
aead_encrypt() / aead_decrypt()
aead_encrypt(key, plaintext, aad) encrypts with XChaCha20-Poly1305 using
a random 24-byte nonce, returning tuple[bytes, bytes] (nonce, ciphertext
with tag). aead_decrypt(key, nonce, ciphertext, aad) decrypts and verifies
the authentication tag. Raises an exception on tag mismatch.
wrap_dek_for_recipient() / unwrap_dek()
ECIES-style DEK wrapping. wrap_dek_for_recipient(dek, recipient_x25519_pub)
generates an ephemeral X25519 keypair, performs key agreement, derives a wrapping key
via HKDF-SHA256 with info string b"oversight-v1-dek-wrap", and encrypts
the DEK with XChaCha20-Poly1305. Returns a dict with hex-encoded
ephemeral_pub, nonce, and wrapped_dek.
unwrap_dek(wrapped, recipient_x25519_priv) reverses the operation.
sign_manifest() / verify_manifest()
sign_manifest(manifest_bytes, ed25519_priv) returns the Ed25519 signature
(bytes). verify_manifest(manifest_bytes, signature, ed25519_pub) returns
bool.
Post-Quantum Functions
Available when oqs-python is installed (PQ_AVAILABLE = True):
pq_kem_keypair(), pq_kem_encap(),
pq_kem_decap() for ML-KEM-768, and
pq_sig_keypair(), pq_sign(),
pq_verify() for ML-DSA-65.
hybrid_wrap_dek() combines X25519 and ML-KEM-768 shared secrets
(X-wing-style) via HKDF with info string
b"oversight-hybrid-v1-dek-wrap".
This reference documents the Python API as of v0.4.5, including the
canonical_content_hash and l3_policy manifest fields and
the l3_policy module added in that release. The Rust implementation
provides equivalent functionality with native types. Consult the
repository for the
latest interfaces.