-
-
Notifications
You must be signed in to change notification settings - Fork 116
Open
Description
This is a top-level issue to cover core deduplication support in the crawler.
The general idea is as follows:
Core Requirements
URL Revisit Record Support:
- The crawler will support hash-based deduplication via Redis, which could be the same redis as --redisStoreUrl (internal / external) or a secondary Redis instance (that can persist separately at a different lifetime). This redis will be specified via --redisDedupUrl.
- The Redis will store
digest->original url, original date
mapping. The first time a WARC-Payload-Digest is encountered, the mapping is added to this redis. - When saving response records, after the WARC-Payload-Digest is computed, if an existing entry for the digest exists in the mapping, a revisit record is created instead.
This allows for URL agnostic deduplication across crawls.
Page Level Dedup / Skipping Support
- The digest can also be computed when the HTML is loaded as part of the Fetch handler. If the digest already exists (eg. HTML page is an exact duplicate), loading the whole page can be skipped, by returning an error code to the browser. A revisit record can be written for the page, but no other resources are loaded. The page list should also record this as a
duplicate
page. - A --minPageDedupDepth setting can be added to only do page dedup for seeds at a minimum depth, eg. the seed is never deduped, but perhaps pages level 2 or 3 can be deduped.
Populating the dedup index:
- It should be possible to populate a deduplication Redis from an existing set of WACZ file, or multi WACZ file, or multi WACZ collection manifest. The crawler will have have a special entry point that iterates over the WACZ files and processes all existing CDX(J) to set up the
digest->original url, original date
mapping. - The state of the ingest should also be saved in the dedup Redis, so that the ingest can happen in parallel / can be retried until all the queued WACZ files are ingested.
Additional Requirements
These will likely be done through follow-up issues, but listing them here:
- Option to control revisit vs skip within a single crawl. Currently, browser already dedups URL+status within the same crawl. Perhaps this could be made more customizable, eg. skip duplicate URLs for X minutes (by status, by hash?)
- Minimum dedup size. Only write revisit records if payload > minimum size
- Fuzzy dedup of page HTML: Apply a transformation regex to the page HTML to determine if its a duplicate, and skip if it is. Based on ideas from: https://github.com/natliblux/browsertrix-crawler-deduplication
Metadata
Metadata
Assignees
Labels
No labels
Type
Projects
Status
Triage