Skip to content

Track seen URLs at all the domains we monitor #173

@Mr0grog

Description

@Mr0grog

We’d like to know when new pages are added or old pages removed from agency websites, but finding that info is hard! At the moment, it’s impractical to do a full crawl of all the websites we monitor (that would be like half of End-of-Term Archive’s job!). There might be narrower ways to do regular crawls kind of like that, but that work is both high effort and highly speculative.

On the other hand, we do have a nice source of rough data: all the links on the several thousand pages we already track regularly. We should build an index of all known page URLs at sites we monitor based on all the links in the current versions of all pages. We can then re-run that calculation and compare the two lists of links whenever a new capture is imported or just on a regular basis — maybe as part of generating weekly task sheets, when we are already looking at every changed page’s links. This will obviously have a lot of false positives (new links to already-existing pages) and miss lots of new pages (that aren’t directly linked from those we monitor), but it’s still likely to be pretty good. At least much simpler to experiment with than trying to crawl entire sites.

See also https://edgi.slack.com/archives/CFA6LE5GX/p1738883150312579

Metadata

Metadata

Assignees

No one assigned

    Projects

    Status

    Prioritized

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions