Detect identical canonical URLs returned by multiple providers or repeated searches.
A repeated URL is one source, not multiple confirmations.Repetition is not verification.
A read-only deduplication readiness layer that defines how TheoB will separate copied repetition from independent corroboration before scoring, Vault ingestion, or capsule compression.
Every TheoB pathway can move through Past, Present, and Future without losing context.
Read current signals, conditions, and live context.
Voice ready
Ten copies of one claim are still one claim.
Discovery Deduplication Readiness prepares TheoB to cluster exact URLs, canonical variants, syndicated content, repeated claims, datasets, images, CAD, schematics, and visual near-duplicates without erasing source trails.
Deduplication readiness is active as a non-destructive policy layer. TheoB can define duplicate clustering logic, but it cannot process live provider results, delete sources, write Vault records, or compress capsules yet.
Cluster URLs that differ only by tracking parameters, mobile versions, trailing slashes, fragments, or protocol variants.
Strip noise carefully without destroying meaningful URL identity.Detect when the same article or release appears across multiple publishers, mirrors, or syndication networks.
Syndication should not be mistaken for independent corroboration.Detect pages that are not identical but substantially repeat the same language, summary, or claim structure.
Near duplicates should lower independence, not erase all records.Cluster sources repeating the same claim even when the article, page, or file is different.
Claims need independent-source context before confidence increases.Detect when datasets represent different versions, revisions, snapshots, or mirrors of the same underlying data.
Dataset revisions must preserve version, timestamp, schema, and source provenance.Detect reused images, resized images, cropped images, screenshots, and visual near-duplicates.
Visual similarity must preserve uncertainty and source trail.Detect reused diagrams, CAD exports, schematics, floor plans, maps, and engineering visual variants.
Schematic differences may be meaningful and cannot be flattened blindly.TheoB must cluster repeated sources before scoring confidence.
Repetition is not verification.Every duplicate cluster must preserve all member sources and original provider trails.
Deduplication must compress noise, not erase provenance.URL normalization and duplicate clustering must show why records were grouped.
No invisible source merging.TheoB must distinguish copied repetition from independent support.
A copied article is not the same as a second witness.This layer defines readiness only and does not process live provider results.
No provider queries, Vault writes, or automatic source changes.Duplicates may be clustered, but records should not be deleted automatically.
Keep auditability and reversibility.Duplicate claims should eventually connect to source scoring and conflict detection.
A repeated claim can still be wrong.Images, diagrams, maps, schematics, CAD, and datasets require specialized duplicate logic.
Do not use text-only duplicate logic for visual or structured files.Intelligence Capsules should be built from deduplicated source clusters, not raw repeated noise.
A capsule must preserve enough truth to be reawakened faithfully.