Skip to content

[YSQL] Backfill fails on errors that are technically retryable #28680

@jasonyb

Description

@jasonyb

Jira Link: DB-18379

Description

BackfillChunk errors are classified as either retryable or nonretryable. The nonretryable ones abort the entire backfill process and prevent the index from being created to completion. The classification is largely implemented as a whitelist of retryable errors, and the rest are considered nonretryable.

We have seen cases where Backfill fails due to rare, unexpected errors such as PerformRPC timed out. A lot of transient errors like these are retryable. It is probably better to switch from a whitelist of retryable errors to a blacklist of nonretryable errors, such as duplicate key error for unique indexes or failure to establish a connection to PG. But it may be difficult to construct such a list. It is more frustrating for a backfill to fail from a flaky issue rather than it succeed with a lot of retries and slow perf.

Also, the BackfillChunk max retries is set to 150, which is too high. It should be lowered to, say, 15 for retryable errors and 5 for nonretryable errors. I know it sounds counterintuitive for nonretryable errors to be allowed retries, but this is considering the old approach of a whitelist of retryable errors and everything else being nonretryable. In this case, you could consider "nonretryable" to be synonymous to "unclassified".

Issue Type

kind/enhancement

Warning: Please confirm that this issue does not contain any sensitive information

  • I confirm this issue does not contain any sensitive information.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions