Deduplication (Initial Support).

This is a top-level issue to cover core deduplication support in the crawler.

The general idea is as follows:

## Core Requirements

URL Revisit Record Support:
- [ ] The crawler will support hash-based deduplication via Redis, which could be the same redis as --redisStoreUrl (internal / external) or a secondary Redis instance (that can persist separately at a different lifetime). This redis will be specified via --redisDedupUrl.
- [ ] The Redis will store `digest->original url, original date` mapping. The first time a WARC-Payload-Digest is encountered, the mapping is added to this redis.
- [ ] When saving response records, after the WARC-Payload-Digest is computed, if an existing entry for the digest exists in the mapping, a revisit record is created instead.

This allows for URL agnostic deduplication across crawls.

Page Level Dedup / Skipping Support
- [ ] The digest can also be computed when the HTML is loaded as part of the Fetch handler. If the digest already exists (eg. HTML page is an exact duplicate), loading the whole page can be skipped, by returning an error code to the browser. A revisit record can be written for the page, but no other resources are loaded. The page list should also record this as a `duplicate` page. 
- [ ] A --minPageDedupDepth setting can be added to only do page dedup for seeds at a minimum depth, eg. the seed is never deduped, but perhaps pages level 2 or 3 can be deduped.

Populating the dedup index:
- [ ] It should be possible to populate a deduplication Redis from an existing set of WACZ file, or multi WACZ file, or multi WACZ collection manifest. The crawler will have have a special entry point that iterates over the WACZ files and processes all existing CDX(J) to set up the `digest->original url, original date` mapping.
- [ ] The state of the ingest should also be saved in the dedup Redis, so that the ingest can happen in parallel / can be retried until all the queued WACZ files are ingested.


## Additional Requirements

These will likely be done through follow-up issues, but listing them here:

- Option to control revisit vs skip within a single crawl. Currently, browser already dedups URL+status within the same crawl. Perhaps this could be made more customizable, eg. skip duplicate URLs for X minutes (by status, by hash?)
- Minimum dedup size. Only write revisit records if payload > minimum size
- Fuzzy dedup of page HTML: Apply a transformation regex to the page HTML to determine if its a duplicate, and skip if it is. Based on ideas from: https://github.com/natliblux/browsertrix-crawler-deduplication





Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Deduplication (Initial Support). #884

Core Requirements

Additional Requirements

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Deduplication (Initial Support). #884

Description

Core Requirements

Additional Requirements

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions