Skip to content

Deduplication (Initial Support). #884

@ikreymer

Description

@ikreymer

This is a top-level issue to cover core deduplication support in the crawler.

The general idea is as follows:

Core Requirements

URL Revisit Record Support:

  • The crawler will support hash-based deduplication via Redis, which could be the same redis as --redisStoreUrl (internal / external) or a secondary Redis instance (that can persist separately at a different lifetime). This redis will be specified via --redisDedupUrl.
  • The Redis will store digest->original url, original date mapping. The first time a WARC-Payload-Digest is encountered, the mapping is added to this redis.
  • When saving response records, after the WARC-Payload-Digest is computed, if an existing entry for the digest exists in the mapping, a revisit record is created instead.

This allows for URL agnostic deduplication across crawls.

Page Level Dedup / Skipping Support

  • The digest can also be computed when the HTML is loaded as part of the Fetch handler. If the digest already exists (eg. HTML page is an exact duplicate), loading the whole page can be skipped, by returning an error code to the browser. A revisit record can be written for the page, but no other resources are loaded. The page list should also record this as a duplicate page.
  • A --minPageDedupDepth setting can be added to only do page dedup for seeds at a minimum depth, eg. the seed is never deduped, but perhaps pages level 2 or 3 can be deduped.

Populating the dedup index:

  • It should be possible to populate a deduplication Redis from an existing set of WACZ file, or multi WACZ file, or multi WACZ collection manifest. The crawler will have have a special entry point that iterates over the WACZ files and processes all existing CDX(J) to set up the digest->original url, original date mapping.
  • The state of the ingest should also be saved in the dedup Redis, so that the ingest can happen in parallel / can be retried until all the queued WACZ files are ingested.

Additional Requirements

These will likely be done through follow-up issues, but listing them here:

  • Option to control revisit vs skip within a single crawl. Currently, browser already dedups URL+status within the same crawl. Perhaps this could be made more customizable, eg. skip duplicate URLs for X minutes (by status, by hash?)
  • Minimum dedup size. Only write revisit records if payload > minimum size
  • Fuzzy dedup of page HTML: Apply a transformation regex to the page HTML to determine if its a duplicate, and skip if it is. Based on ideas from: https://github.com/natliblux/browsertrix-crawler-deduplication

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Triage

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions