Skip to content

Option to ignore further seeds from a host that has connection failures #879

@Mr0grog

Description

@Mr0grog

This is probably a bit of an edge case, but I have a crawl that has seeds across a lot of hostnames, with many seeds at each of those hostnames. Sometimes a server goes offline, leading to really slow crawls (each of the seeds for that hostname potentially takes a long time for Browsertrix to give up connecting to a server, and then it retries a few times as well).

That’s all totally reasonable and expected, but it would be really nice if I could configure Browsertrix to delay crawling other seeds at a given hostname if the first seed at that hostname fails with a connection error (DNS resolution or connection timeout, I think), and to discard those seeds entirely (so it just never bothers trying to load them) if the first seed completely fails after retries are exhausted. In the vast majority of cases, that seems like reasonable and generally correct behavior, and it would really help not waste a lot of time and resources on pointless requests.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Triage

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions